Flight delays are a significant societal problem as they impair airlines, transport companies, air traffic controllers, facility managers, and passengers. Studying flight data is an essential activity for every player involved in the air transportation system. Given the uncertainty of delays, passengers usually plan to travel many hours earlier for their appointments. They might have to increase their travel costs to arrive on time. Airlines suffer penalties, fines, and additional operation costs, such as airport crew and aircraft retention. Regarding sustainability, delays may also increase environmental issues due to increased fuel consumption and gas emissions (Rebollo and Balakrishnan 2014) (Sternberg et al., 2017); (Carvalho et al. 2021).
This project aims to generate a prediction model of departure delay time. A reliable prediction will benefit airports, airlines, passengers, and global environmental sustainability.
Our analysis is based on the public flights dataset taken from the “nycflights13” R package. This package contains information about all flights that departed from NYC in 2013. We used additional datasets from the same package, such as ‘planes’, ‘weather’, and ‘airport’. These datasets provide extensive information about the flights. ‘planes’ dataset contains construction information about each plane and was merged with the ‘flights’ dataset by the common variable column “tailnum”. ‘weather’ dataset, which includes hourly meteorological data for each airport, was merged with ‘flights’ according to the “origin” and “time_hour” variable columns. Similarly, the ‘airports’ dataset, which represents all involved airports and their locations, was merged with ‘flights’ by “faa”(=“dest”) variable column. The merged dataset contains 47 variables and 276688 observations.
#Load packages:
options(repos = list(CRAN="http://cran.rstudio.com/"))
list.of.packages <-
c(
"dataPreparation",
"nycflights13",
"dplyr",
"tidyverse",
"RColorBrewer",
"ggplot2",
"lubridate",
"ROSE",
"ranger",
"gridExtra",
"ranger",
"ROSE",
"caret",
"rpart",
"rpart.plot",
"rattle",
"Metrics",
"mlr",
"plotly",
"caTools",
"psych"
)
new.packages <-
list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages)) {
install.packages(new.packages)
}
library(dataPreparation)
library(nycflights13)
library(dplyr)
library(tidyverse)
library(RColorBrewer)
library(ggplot2)
library(lubridate)
library(ROSE)
library(ranger)
library(gridExtra)
library(ranger)
library(ROSE)
library(caret)
library(rpart)
library(rattle)
library(rpart.plot)
library(Metrics)
library(mlr)
library(plotly)
library(caTools)
library(psych)
flights <- nycflights13::flights
airports <- nycflights13::airports
planes <- nycflights13::planes
weather <- nycflights13::weather
#merged flights + weather + planes + airports(tzone)
flights_planes <- merge(flights, planes, by = "tailnum")
flights_planes_weather <-
merge(flights_planes, weather, by = c("origin", "time_hour"))
flights_planes_weather <-
flights_planes %>% right_join(weather, by = c("origin", "time_hour"))
flights_full <-
merge(flights_planes_weather,
airports,
by.x = c("dest"),
by.y = c("faa"))
dim(flights_full)
## [1] 276688 47
head(flights_full)
## dest tailnum year.x month.x day.x dep_time sched_dep_time dep_delay arr_time
## 1 ABQ N637JB 2013 7 9 2004 2007 -3 2247
## 2 ABQ N566JB 2013 12 6 2008 2001 7 2350
## 3 ABQ N563JB 2013 11 25 2014 2000 14 2300
## 4 ABQ N625JB 2013 9 20 2000 2001 -1 2245
## 5 ABQ N547JB 2013 8 18 2006 2007 -1 2226
## 6 ABQ N712JB 2013 9 7 1958 2001 -3 2205
## sched_arr_time arr_delay carrier flight origin air_time distance hour.x
## 1 2259 -12 B6 1505 JFK 227 1826 20
## 2 2304 46 B6 65 JFK 304 1826 20
## 3 2303 -3 B6 65 JFK 263 1826 20
## 4 2248 -3 B6 65 JFK 250 1826 20
## 5 2259 -33 B6 1505 JFK 235 1826 20
## 6 2248 -43 B6 65 JFK 221 1826 20
## minute time_hour year.y type manufacturer
## 1 7 2013-07-09 20:00:00 2006 Fixed wing multi engine AIRBUS
## 2 1 2013-12-06 20:00:00 2003 Fixed wing multi engine AIRBUS
## 3 0 2013-11-25 20:00:00 2003 Fixed wing multi engine AIRBUS
## 4 1 2013-09-20 20:00:00 2005 Fixed wing multi engine AIRBUS
## 5 7 2013-08-18 20:00:00 2002 Fixed wing multi engine AIRBUS
## 6 1 2013-09-07 20:00:00 2008 Fixed wing multi engine AIRBUS
## model engines seats speed engine year month.y day.y hour.y temp dewp
## 1 A320-232 2 200 NA Turbo-fan 2013 7 9 20 78.08 71.96
## 2 A320-232 2 200 NA Turbo-fan 2013 12 6 20 39.92 35.96
## 3 A320-232 2 200 NA Turbo-fan 2013 11 25 20 33.08 12.92
## 4 A320-232 2 200 NA Turbo-fan 2013 9 20 20 66.02 59.00
## 5 A320-232 2 200 NA Turbo-fan 2013 8 18 20 69.98 60.08
## 6 A320-232 2 200 NA Turbo-fan 2013 9 7 20 69.98 57.92
## humid wind_dir wind_speed wind_gust precip pressure visib
## 1 81.50 200 8.05546 NA 0.00 1016.1 10
## 2 85.61 20 18.41248 NA 0.03 1020.5 5
## 3 42.84 230 16.11092 NA 0.00 1028.9 10
## 4 78.08 170 8.05546 NA 0.00 1016.8 10
## 5 70.81 90 3.45234 NA 0.00 1020.9 10
## 6 65.54 200 14.96014 NA 0.00 1014.1 10
## name lat lon alt tz dst
## 1 Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 2 Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 3 Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 4 Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 5 Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 6 Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## tzone
## 1 America/Denver
## 2 America/Denver
## 3 America/Denver
## 4 America/Denver
## 5 America/Denver
## 6 America/Denver
In order to successfully predict a delayed flight we would like to use the most affecting features/variables in our model and avoid features that may increase noise. Achieving this goal requires pre-processing steps in which we reduce the number of variables and the number of levels/categories in variables.
First, we plotted numeric variables vs. departure delay in order to visually spot if there is a linear relationship between them. On top of the scatter plot there is a linear model fit line:
plot_var_vs_dep_delay <-
function(var_name, df = numeric_flights_full){
df <- df[sample(nrow(df), 10000), ]
p<-ggplot(data = df, mapping = aes(x = get(var_name), y = dep_delay)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm",level=0.2) +
labs(
x = sprintf("%s", var_name),
y = "Departure delay (min)",
title = sprintf("%s vs departure delay", var_name))
return(p)
}
numeric_vars <- c(
'sched_dep_time',
'dep_delay',
'arr_delay',
'distance',
'air_time',
'temp',
'wind_dir',
'wind_speed',
'wind_gust',
"precip",
"pressure",
"visib",
"humid",
"dewp"
)
numeric_flights_full <- flights_full %>% select(all_of(numeric_vars))
plot_list <- lapply(numeric_vars, plot_var_vs_dep_delay)
grid.arrange(grobs = plot_list, ncol = 3)
We would like to examine a correlation between numeric variables. In order to calculate Pearson correlation, certain assumptions must be met, one of them is that there must be a linear relationship between the variables. Observing the scatter plots above, we can see there is no linear relationship between those numeric variables. Thus, we chose to apply Spearman correlation test, which does not require linear relationship assumptions.
Spearman correlation test between numerical variables in the datase:
cor_test_mat <- corr.test(numeric_flights_full, method = "spearman")
#r
cor_test_mat$r
## sched_dep_time dep_delay arr_delay distance
## sched_dep_time 1.000000000 0.242468155 0.15972763 -0.020968959
## dep_delay 0.242468155 1.000000000 0.63416915 0.076474466
## arr_delay 0.159727631 0.634169152 1.00000000 -0.071903136
## distance -0.020968959 0.076474466 -0.07190314 1.000000000
## air_time -0.021874600 0.079279091 -0.02003161 0.984448446
## temp 0.078878376 0.046728973 -0.02567579 0.006721989
## wind_dir 0.009809660 -0.009695863 -0.01368203 0.001601394
## wind_speed 0.123476957 0.065576107 0.07031103 0.019372379
## wind_gust 0.050621861 0.036945207 0.06842991 0.030620264
## precip 0.009062127 0.116261731 0.15317115 0.003132605
## pressure -0.074564174 -0.113104964 -0.12835539 0.008417160
## visib 0.082449488 -0.082294763 -0.12681222 -0.009263453
## humid -0.159960170 0.086862748 0.11285397 0.023404245
## dewp -0.008916716 0.080997447 0.02851065 0.016728963
## air_time temp wind_dir wind_speed wind_gust
## sched_dep_time -0.021874600 0.078878376 0.009809660 0.12347696 0.05062186
## dep_delay 0.079279091 0.046728973 -0.009695863 0.06557611 0.03694521
## arr_delay -0.020031610 -0.025675795 -0.013682031 0.07031103 0.06842991
## distance 0.984448446 0.006721989 0.001601394 0.01937238 0.03062026
## air_time 1.000000000 -0.049792971 0.011814677 0.02951930 0.04791494
## temp -0.049792971 1.000000000 -0.144816796 -0.11791494 -0.35254088
## wind_dir 0.011814677 -0.144816796 1.000000000 0.34195792 0.13169708
## wind_speed 0.029519300 -0.117914944 0.341957922 1.00000000 0.87662980
## wind_gust 0.047914943 -0.352540879 0.131697082 0.87662980 1.00000000
## precip 0.019925058 -0.065799680 -0.083363231 0.04379486 0.08782057
## pressure -0.001528471 -0.239281149 -0.186149803 -0.20058037 -0.21646913
## visib -0.024508213 0.054669345 0.223136647 0.12964654 -0.10737224
## humid 0.028202000 0.045535829 -0.351682726 -0.21874997 -0.01924615
## dewp -0.030345953 0.881335226 -0.279280924 -0.19406358 -0.30536777
## precip pressure visib humid dewp
## sched_dep_time 0.009062127 -0.074564174 0.082449488 -0.15996017 -0.008916716
## dep_delay 0.116261731 -0.113104964 -0.082294763 0.08686275 0.080997447
## arr_delay 0.153171154 -0.128355394 -0.126812225 0.11285397 0.028510646
## distance 0.003132605 0.008417160 -0.009263453 0.02340424 0.016728963
## air_time 0.019925058 -0.001528471 -0.024508213 0.02820200 -0.030345953
## temp -0.065799680 -0.239281149 0.054669345 0.04553583 0.881335226
## wind_dir -0.083363231 -0.186149803 0.223136647 -0.35168273 -0.279280924
## wind_speed 0.043794856 -0.200580367 0.129646536 -0.21874997 -0.194063580
## wind_gust 0.087820566 -0.216469127 -0.107372244 -0.01924615 -0.305367772
## precip 1.000000000 -0.081572193 -0.480288221 0.37760322 0.102317758
## pressure -0.081572193 1.000000000 0.128824903 -0.13973204 -0.267869766
## visib -0.480288221 0.128824903 1.000000000 -0.56241399 -0.197036048
## humid 0.377603217 -0.139732043 -0.562413993 1.00000000 0.494198023
## dewp 0.102317758 -0.267869766 -0.197036048 0.49419802 1.000000000
#p value
cor_test_mat$p
## sched_dep_time dep_delay arr_delay distance
## sched_dep_time 0.000000e+00 0.000000e+00 0.000000e+00 5.412243e-27
## dep_delay 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## arr_delay 0.000000e+00 0.000000e+00 0.000000e+00 1.878316e-306
## distance 2.706121e-28 0.000000e+00 4.268899e-308 0.000000e+00
## air_time 4.124446e-30 0.000000e+00 1.621108e-25 0.000000e+00
## temp 0.000000e+00 1.439739e-131 7.639191e-41 4.065868e-04
## wind_dir 3.512590e-07 5.840768e-07 1.942815e-12 4.056914e-01
## wind_speed 0.000000e+00 2.529843e-257 1.365972e-294 2.203719e-24
## wind_gust 7.008941e-39 4.669118e-21 5.446890e-68 3.120584e-15
## precip 1.871206e-06 0.000000e+00 0.000000e+00 9.939732e-02
## pressure 1.573520e-300 0.000000e+00 0.000000e+00 2.953352e-05
## visib 0.000000e+00 0.000000e+00 0.000000e+00 1.100475e-06
## humid 0.000000e+00 0.000000e+00 0.000000e+00 7.787442e-35
## dewp 2.728860e-06 0.000000e+00 5.923647e-50 1.368616e-18
## air_time temp wind_dir wind_speed
## sched_dep_time 8.661337e-29 0.000000e+00 3.863849e-06 0.000000e+00
## dep_delay 0.000000e+00 5.039086e-130 5.840768e-06 1.011937e-255
## arr_delay 3.080106e-24 1.986190e-39 2.525659e-11 5.737081e-293
## distance 0.000000e+00 1.626347e-03 8.113828e-01 3.746322e-23
## air_time 0.000000e+00 4.464886e-147 1.461914e-08 5.964947e-52
## temp 1.240246e-148 0.000000e+00 0.000000e+00 0.000000e+00
## wind_dir 1.218262e-09 0.000000e+00 0.000000e+00 0.000000e+00
## wind_speed 2.056878e-53 0.000000e+00 0.000000e+00 0.000000e+00
## wind_gust 3.391162e-34 0.000000e+00 1.678106e-253 0.000000e+00
## precip 2.905854e-25 4.814770e-263 0.000000e+00 1.663411e-117
## pressure 4.515856e-01 0.000000e+00 0.000000e+00 0.000000e+00
## visib 2.283766e-37 4.125182e-182 0.000000e+00 0.000000e+00
## humid 6.461952e-49 6.597392e-127 0.000000e+00 0.000000e+00
## dewp 2.344350e-56 0.000000e+00 0.000000e+00 0.000000e+00
## wind_gust precip pressure visib
## sched_dep_time 1.752235e-37 1.309844e-05 6.766138e-299 0.000000e+00
## dep_delay 7.470589e-20 0.000000e+00 0.000000e+00 0.000000e+00
## arr_delay 1.688536e-66 0.000000e+00 0.000000e+00 0.000000e+00
## distance 4.368817e-14 2.981919e-01 1.476676e-04 8.803803e-06
## air_time 7.460556e-33 5.230537e-24 8.113828e-01 5.481038e-36
## temp 0.000000e+00 1.974056e-261 0.000000e+00 1.567569e-180
## wind_dir 6.544614e-252 0.000000e+00 0.000000e+00 0.000000e+00
## wind_speed 0.000000e+00 5.489257e-116 0.000000e+00 0.000000e+00
## wind_gust 0.000000e+00 3.903508e-112 0.000000e+00 1.238188e-167
## precip 1.219846e-113 0.000000e+00 0.000000e+00 0.000000e+00
## pressure 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## visib 3.346453e-169 0.000000e+00 0.000000e+00 0.000000e+00
## humid 7.196687e-07 0.000000e+00 0.000000e+00 0.000000e+00
## dewp 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## humid dewp
## sched_dep_time 0.000000e+00 1.637316e-05
## dep_delay 0.000000e+00 0.000000e+00
## arr_delay 0.000000e+00 1.658621e-48
## distance 1.791112e-33 2.052924e-17
## air_time 1.744727e-47 7.033051e-55
## temp 2.243113e-125 0.000000e+00
## wind_dir 0.000000e+00 0.000000e+00
## wind_speed 0.000000e+00 0.000000e+00
## wind_gust 6.477018e-06 0.000000e+00
## precip 0.000000e+00 0.000000e+00
## pressure 0.000000e+00 0.000000e+00
## visib 0.000000e+00 0.000000e+00
## humid 0.000000e+00 0.000000e+00
## dewp 0.000000e+00 0.000000e+00
Plot of correlation heatmap:
#plot the correlation matrix
cor.plot(cor_test_mat$r, main = "Correlation Matrix (spearman)", stars = TRUE, xlas = 2)
The variables ‘dewp’ and ‘temp’ are highly correlated (0.9), so are ‘wind_speed’ and ‘wind_gust’ (0.87), and ‘air_time’ and ‘distance’ (0.98).The correlation is significant (pval<0.05) We decided to remove ‘dewp’, ‘air_time’, and ‘wind_gust’ to avoid double variables in the model. All three variables are logically an outcome of the variable they are highly correlated to.
We edited some of the columns in our dataset in order to improve their potential contribution to the prediction:
Convert hours column to minutes
flights_full <-
flights_full %>% mutate(
sched_dep_time = hour.x * 60 + minute,
sched_arr_time = floor(sched_arr_time / 100) *
60 + sched_arr_time %% 100
)
Convert days to week days
flights_full <- flights_full %>%
mutate(w_day = wday(time_hour, label = TRUE))
Convert days to weeks (52 weeks per year)
flights_full <-
flights_full %>% mutate(week_num = (year(time_hour) - year(min(time_hour))) * 52 +
week(time_hour) - week(min(time_hour)))
Convert wind direction from degrees to 16 compass directions
directions <- read.csv('../output/wind_directions.csv')
flights_full <- flights_full %>%
mutate(wind_dir = cut(
as.numeric(wind_dir),
breaks = c(0, directions$degree_max, 360),
labels = c(directions$cardinal, 'N')
))
#note: when the wind direction is 0 degrees the wd_cardinal is NA.
#Also, the wind speed is 0.
#This is correct because if the wind is not moving then it does not have a direction.
Remove NA from dep_delay
flights_full <- flights_full %>% drop_na(dep_delay)
Remove sd outliers
flights_full <-
remove_sd_outlier(flights_full,
cols = "dep_delay",
n_sigmas = 7,
verbose = TRUE)
## [1] "remove_sd_outlier: I start to filter categorical rare events"
## [1] "remove_sd_outlier: dropped 548 row(s) that are rare event on dep_delay."
## [1] "remove_sd_outlier: 548 have been dropped. It took 0.2 seconds. "
Convert the classes of part of the columns
flights_full <- transform(
flights_full,
origin = as.factor(origin),
carrier = as.factor(carrier),
tzone = as.factor(tzone),
type = as.factor(type),
model = as.factor(model),
engine = as.factor(engine),
hour.y = as.numeric(hour.y),
hour.x = as.numeric(hour.x),
manufacturer = as.factor(manufacturer),
month.x = ordered(as.factor(month.x)),
month.y = ordered(as.factor(month.y)),
dest<- as.factor(dest)
)
Remove identical and constant variables
#identical columns
identical(flights_full$hour.x, flights_full$hour.y)
## [1] TRUE
identical(flights_full$month.x, flights_full$month.y)
## [1] TRUE
identical(flights_full$day.x, flights_full$day.y)
## [1] TRUE
#remove identical and constant columns
flights_full <- fast_filter_variables(
flights_full,
level = 2,
keep_cols = NULL,
verbose = TRUE
)
## [1] "fast_filter_variables: I check for constant columns."
## [1] "fast_filter_variables: I delete 2 constant column(s) in data_set."
## [1] "fast_filter_variables: I check for columns in double."
## [1] "fast_filter_variables: I delete 3 column(s) that are in double in data_set."
Let’s take a look at the data:
summary(flights_full)
## dest tailnum month.x day.x
## Length:271980 Length:271980 10 : 23965 Min. : 1.00
## Class :character Class :character 8 : 23847 1st Qu.: 8.00
## Mode :character Mode :character 5 : 23608 Median :16.00
## 7 : 23563 Mean :15.69
## 3 : 23106 3rd Qu.:23.00
## 4 : 23058 Max. :31.00
## (Other):130833
## dep_time sched_dep_time dep_delay arr_time
## Min. : 1 Min. : 300.0 Min. :-43.0 Min. : 1
## 1st Qu.: 909 1st Qu.: 545.0 1st Qu.: -5.0 1st Qu.:1103
## Median :1359 Median : 835.0 Median : -1.0 Median :1536
## Mean :1351 Mean : 814.4 Mean : 12.5 Mean :1506
## 3rd Qu.:1750 3rd Qu.:1049.0 3rd Qu.: 11.0 3rd Qu.:1945
## Max. :2400 Max. :1425.0 Max. :298.0 Max. :2400
## NA's :345
## sched_arr_time arr_delay carrier flight origin
## Min. : 1.0 Min. :-86.000 UA :55437 Min. : 1 EWR:110190
## 1st Qu.: 681.0 1st Qu.:-17.000 EV :51015 1st Qu.: 509 JFK: 88371
## Median : 955.0 Median : -5.000 B6 :49385 Median :1409 LGA: 73419
## Mean : 934.9 Mean : 6.376 DL :46007 Mean :1882
## 3rd Qu.:1188.0 3rd Qu.: 14.000 US :19758 3rd Qu.:3315
## Max. :1439.0 Max. :441.000 9E :17310 Max. :8500
## NA's :926 (Other):33068
## air_time distance hour.x minute
## Min. : 20.0 Min. : 80 Min. : 5.00 Min. : 0.00
## 1st Qu.: 82.0 1st Qu.: 529 1st Qu.: 9.00 1st Qu.: 6.00
## Median :129.0 Median : 888 Median :13.00 Median :29.00
## Mean :152.8 Mean :1062 Mean :13.14 Mean :25.86
## 3rd Qu.:194.0 3rd Qu.:1400 3rd Qu.:17.00 3rd Qu.:42.00
## Max. :695.0 Max. :4983 Max. :23.00 Max. :59.00
## NA's :926
## time_hour year.y
## Min. :2013-01-01 05:00:00.00 Min. :1956
## 1st Qu.:2013-04-05 16:00:00.00 1st Qu.:1999
## Median :2013-07-04 09:00:00.00 Median :2002
## Mean :2013-07-03 16:34:46.05 Mean :2001
## 3rd Qu.:2013-10-01 12:00:00.00 3rd Qu.:2006
## Max. :2013-12-30 18:00:00.00 Max. :2013
## NA's :5054
## type manufacturer
## Fixed wing multi engine :270051 BOEING :79441
## Fixed wing single engine: 1552 EMBRAER :63272
## Rotorcraft : 377 AIRBUS :43742
## AIRBUS INDUSTRIE :40039
## BOMBARDIER INC :27416
## MCDONNELL DOUGLAS AIRCRAFT CO: 8766
## (Other) : 9304
## model engines seats speed
## A320-232 : 41865 Min. :1.000 Min. : 2.0 Min. : 90.0
## EMB-145LR : 26431 1st Qu.:2.000 1st Qu.: 55.0 1st Qu.:105.0
## ERJ 190-100 IGW: 23280 Median :2.000 Median :149.0 Median :126.0
## 737-824 : 13320 Mean :1.994 Mean :136.4 Mean :152.8
## EMB-145XR : 13299 3rd Qu.:2.000 3rd Qu.:189.0 3rd Qu.:127.0
## CL-600-2D24 : 11666 Max. :4.000 Max. :450.0 Max. :432.0
## (Other) :142119 NA's :271073
## engine temp dewp humid
## 4 Cycle : 37 Min. : 10.94 Min. :-9.94 Min. : 12.74
## Reciprocating: 1648 1st Qu.: 42.08 1st Qu.:26.06 1st Qu.: 43.75
## Turbo-fan :230495 Median : 57.92 Median :42.98 Median : 57.33
## Turbo-jet : 39380 Mean : 57.11 Mean :41.62 Mean : 59.29
## Turbo-prop : 43 3rd Qu.: 71.96 3rd Qu.:57.92 3rd Qu.: 74.80
## Turbo-shaft : 377 Max. :100.04 Max. :78.08 Max. :100.00
## NA's :15 NA's :15 NA's :15
## wind_dir wind_speed wind_gust precip
## W : 28945 Min. : 0.000 Min. :16.11 Min. :0.000000
## S : 28309 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.000000
## NW : 26301 Median :10.357 Median :24.17 Median :0.000000
## SW : 20188 Mean :10.993 Mean :25.06 Mean :0.004304
## WNW : 18852 3rd Qu.:13.809 3rd Qu.:27.62 3rd Qu.:0.000000
## (Other):132408 Max. :42.579 Max. :66.75 Max. :1.210000
## NA's : 16977 NA's :69 NA's :207178
## pressure visib name lat
## Min. : 983.8 Min. : 0.000 Length:271980 Min. :21.32
## 1st Qu.:1012.9 1st Qu.:10.000 Class :character 1st Qu.:32.73
## Median :1017.6 Median :10.000 Mode :character Median :36.08
## Mean :1017.9 Mean : 9.287 Mean :36.01
## 3rd Qu.:1022.9 3rd Qu.:10.000 3rd Qu.:41.41
## Max. :1042.1 Max. :10.000 Max. :61.17
## NA's :29023
## lon alt tz dst
## Min. :-157.92 Min. : 3.0 Min. :-10.0 Length:271980
## 1st Qu.: -95.34 1st Qu.: 26.0 1st Qu.: -6.0 Class :character
## Median : -84.22 Median : 313.0 Median : -5.0 Mode :character
## Mean : -90.12 Mean : 591.4 Mean : -5.8
## 3rd Qu.: -80.15 3rd Qu.: 748.0 3rd Qu.: -5.0
## Max. : -68.83 Max. :6602.0 Max. : -5.0
##
## tzone w_day week_num V2
## America/Anchorage : 6 Sun:37893 Min. : 0.00 LAX : 15368
## America/Chicago : 55657 Mon:40906 1st Qu.:13.00 ATL : 14320
## America/Denver : 9837 Tue:40352 Median :26.00 BOS : 13534
## America/Los_Angeles: 43201 Wed:40641 Mean :25.74 MCO : 13061
## America/New_York :158016 Thu:40400 3rd Qu.:39.00 SFO : 12639
## America/Phoenix : 4562 Fri:40416 Max. :51.00 CLT : 12064
## Pacific/Honolulu : 701 Sat:31372 (Other):190994
Suspected columns to have mostly NAs
length(which(is.na(flights_full$speed)))
## [1] 271073
length(which(is.na(flights_full$wind_gust)))
## [1] 207178
Since we would like to predict if a particular flight will be delayed, we logically selected a threshold indicating a significant delay on time - 20 minutes.
Flights that depart earlier than 10 minutes before their scheduled time (departure delay <= -10) were removed from the following analysis since they may confuse our model and add noise, as we aim to predict positive delay. Those flights with departure delay <= (-10) should be included in the analysis in case of prediction of negative delay.
Here we assume that a flight is considered as delayed only if its departure time was delayed by 20 minutes or more. A flight is considered as departed on time, i.e, no delay, if its departure time is in the range of 10 minutes earlier and until 20 minutes late (not including 20 minutes late).
#divide the flights into 2 groups according to their dep_delay - flights above 20 min delay, and flights above -10 & until 20 min delay
flights_full_new_dep_delay <-
flights_full[which(flights_full$dep_delay > -10),]
flights_full_arranged <-
flights_full_new_dep_delay %>% arrange(dep_delay)
# plot histogram of original dep_delay before changing it into 2 categories with the threshold
ggplot(flights_full_arranged, aes(x = dep_delay)) +
geom_histogram(color = "black", fill = "white", bins = 40) +
geom_vline(aes(xintercept = 20, color = "delay > 20 min"),
linetype = "dashed",
size = 1.3) +
scale_color_manual(name = "Treshold delay time", values = c("delay > 20 min" = "red")) +
labs(
x = "Departure delay time [min]",
y = "counts of flights",
title = paste('Histogram of departure delay time')
) +
theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))
# percentage of dep_delay=0
(length(which(flights_full_arranged$dep_delay < (20)))) / nrow(flights_full_arranged)
## [1] 0.7971997
# percentage of dep_delay=1
(length(which(flights_full_arranged$dep_delay >= (20)))) / nrow(flights_full_arranged)
## [1] 0.2028003
#change dep_delay column into categories (0 / 1)
flights_full_arranged <-
flights_full_arranged %>% mutate(dep_delay = case_when(dep_delay < 20 ~ 0,
dep_delay >= 20 ~ 1))
#convert dep_delay to factor column
flights_full_arranged$dep_delay <-
as.factor(flights_full_arranged$dep_delay)
#plot flights counts per 2 dep_delay categories - add legend
ggplot(flights_full_arranged, aes(dep_delay, fill = dep_delay)) + geom_bar(stat="count") +
scale_fill_manual(labels=c("0 = no delay", "1 = delay"), values = c('#CC6666', '#FFCCCC')) +
labs(title = "Flights counts per departure delay category", x = "Departure delay categories") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
Remove irrelevant columns
flights_full_arranged <-
select(
flights_full_arranged,-c(
time_hour,
arr_delay,
flight,
tailnum,
arr_time,
dep_time,
name,
lat,
lon,
alt,
tz,
dst,
speed,
wind_gust,
dewp,
hour.x,
minute,
day.x,
air_time
)
)
str(flights_full_arranged)
## Classes 'data.table' and 'data.frame': 262613 obs. of 26 variables:
## $ dest : chr "ABQ" "ABQ" "ABQ" "ABQ" ...
## $ month.x : Ord.factor w/ 12 levels "1"<"2"<"3"<"4"<..: 5 9 11 10 10 9 7 9 7 5 ...
## $ sched_dep_time: num 1201 1201 1200 1199 1199 ...
## $ dep_delay : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sched_arr_time: num 1388 1368 1383 1366 1366 ...
## $ carrier : Factor w/ 16 levels "9E","AA","AS",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 2 2 2 2 2 2 2 2 2 2 ...
## $ distance : num 1826 1826 1826 1826 1826 ...
## $ year.y : int 2004 2004 2011 2006 2004 2011 2005 2011 2006 2013 ...
## $ type : Factor w/ 3 levels "Fixed wing multi engine",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ manufacturer : Factor w/ 35 levels "AGUSTA SPA","AIRBUS",..: 2 2 2 2 2 18 18 18 18 18 ...
## $ model : Factor w/ 127 levels "150","172E","172M",..: 89 89 89 89 89 109 109 109 109 109 ...
## $ engines : int 2 2 2 2 2 2 2 2 2 2 ...
## $ seats : int 200 200 200 200 200 20 20 20 20 20 ...
## $ engine : Factor w/ 6 levels "4 Cycle","Reciprocating",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : num 57.9 63 52 60.1 53.1 ...
## $ humid : num 69.5 70.1 36.3 74.6 48.5 ...
## $ wind_dir : Factor w/ 16 levels "N","NNE","NE",..: 9 4 13 9 11 15 11 14 10 10 ...
## $ wind_speed : num 6.9 3.45 18.41 11.51 20.71 ...
## $ precip : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pressure : num 1021 1025 1017 1009 1014 ...
## $ visib : num 10 10 10 10 10 10 10 10 9 6 ...
## $ tzone : Factor w/ 7 levels "America/Anchorage",..: 3 3 3 3 3 5 5 5 5 5 ...
## $ w_day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 7 1 7 7 5 3 1 3 4 ...
## $ week_num : num 20 38 44 41 42 35 29 37 27 21 ...
## $ V2 : Factor w/ 100 levels "ABQ","ACK","ALB",..: 1 1 1 1 1 2 2 2 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
We wanted to check whether the variable ‘model’ is a unique variable. That is, there are no model names that appear under two different manufacturers. Our examination revealed a small number of models whose names are not unique and appear under two different manufacturers. To overcome this problem, we decided to combine ‘manufacturer’ and ‘model’ columns into one column named ‘manu_model’ with the aim of reducing the amount of variables in the dataset. The manu-model column contains unique names for each model because each model is coupled to its manufacturer’s name.
Later we checked whether this variable has priority over the manufacturer variable, and we saw that the variable is equivalent and, in some cases, even takes on higher importance in the models we run. Therefore, we will delete the ‘model’ and ‘manufacturer’ columns and leave the ‘manu-model’ column containing both information.
# show in how many different manufacturers each model presented
manufacturer_model_df<-flights_full_arranged %>% group_by(manufacturer, model) %>% summarise(counts=n())
table_manufacturer_model<-table(manufacturer_model_df$model)
table_manufacturer_model[table_manufacturer_model>1] # models' prevalence in different manufacturers
##
## A319-112 A319-114 A319-131 A319-132 A320-211 A320-212
## 2 2 2 2 2 2
## A320-214 A320-232 A321-211 A321-231 A330-223 CL-600-2B19
## 2 2 2 2 2 2
## FALCON-XP FALCON XP MD-88 MD-90-30
## 3 5 2 2
Combine manufacturer and model columns into one column
manu_model <-
paste(flights_full_arranged$manufacturer,
flights_full_arranged$model,
sep = "_")
flights_full_arranged$manu_model <- manu_model
In order to visually check whether a variable demonstrates variability in delayed flights (in cases when dep_delay=1), we created a bar plot for each variable separately while considering the number of flights in each category. Thus, for each category in a given variable we calculated the relative amount of delayed flights out of total flights in the current category. Due to this calculation, we can compare the categories.
flights_full_arranged_factors_all <-
data.frame(lapply(flights_full_arranged, factor))
str(flights_full_arranged_factors_all)
## 'data.frame': 262613 obs. of 27 variables:
## $ dest : Factor w/ 100 levels "ABQ","ACK","ALB",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ month.x : Ord.factor w/ 12 levels "1"<"2"<"3"<"4"<..: 5 9 11 10 10 9 7 9 7 5 ...
## $ sched_dep_time: Factor w/ 1011 levels "300","301","305",..: 871 871 870 869 869 389 399 389 150 370 ...
## $ dep_delay : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sched_arr_time: Factor w/ 1102 levels "1","2","3","4",..: 1051 1031 1046 1029 1029 442 459 442 212 430 ...
## $ carrier : Factor w/ 16 levels "9E","AA","AS",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 2 2 2 2 2 2 2 2 2 2 ...
## $ distance : Factor w/ 205 levels "80","94","96",..: 179 179 179 179 179 14 14 14 14 14 ...
## $ year.y : Factor w/ 46 levels "1956","1959",..: 37 37 44 39 37 44 38 44 39 46 ...
## $ type : Factor w/ 3 levels "Fixed wing multi engine",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ manufacturer : Factor w/ 35 levels "AGUSTA SPA","AIRBUS",..: 2 2 2 2 2 18 18 18 18 18 ...
## $ model : Factor w/ 127 levels "150","172E","172M",..: 89 89 89 89 89 109 109 109 109 109 ...
## $ engines : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
## $ seats : Factor w/ 48 levels "2","4","5","6",..: 34 34 34 34 34 13 13 13 13 13 ...
## $ engine : Factor w/ 6 levels "4 Cycle","Reciprocating",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Factor w/ 167 levels "10.94","12.02",..: 101 112 82 105 85 132 143 122 132 120 ...
## $ humid : Factor w/ 2431 levels "12.74","13","13.95",..: 1807 1822 672 1938 1152 1141 1808 923 1988 2028 ...
## $ wind_dir : Factor w/ 16 levels "N","NNE","NE",..: 9 4 13 9 11 15 11 14 10 10 ...
## $ wind_speed : Factor w/ 34 levels "0","3.45234",..: 5 2 15 9 17 11 10 13 7 9 ...
## $ precip : Factor w/ 55 levels "0","0.01","0.02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pressure : Factor w/ 453 levels "983.8","985",..: 258 296 215 141 193 203 85 127 237 242 ...
## $ visib : Factor w/ 20 levels "0","0.06","0.12",..: 20 20 20 20 20 20 20 20 19 16 ...
## $ tzone : Factor w/ 7 levels "America/Anchorage",..: 3 3 3 3 3 5 5 5 5 5 ...
## $ w_day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 7 1 7 7 5 3 1 3 4 ...
## $ week_num : Factor w/ 52 levels "0","1","2","3",..: 21 39 45 42 43 36 30 38 28 22 ...
## $ V2 : Factor w/ 100 levels "ABQ","ACK","ALB",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ manu_model : Factor w/ 147 levels "AGUSTA SPA_A109E",..: 23 23 23 23 23 122 122 122 122 122 ...
gplot_var <- function(var_name, convert_x, palette_a = FALSE) {
count_df_var <-
flights_full_arranged_factors_all %>% group_by(!!(sym(var_name))) %>% summarize(total_count =
n())
df_var <-
flights_full_arranged_factors_all %>% group_by(!!(sym(var_name)), dep_delay) %>% tally() %>% filter(dep_delay ==
1)
df_var <- merge(df_var, count_df_var, by = var_name)
df_var <- df_var %>% mutate(relative_count_1 = n / total_count)
varX <- paste(var_name, ".x", sep = "")
colnames(df_var)[which(colnames(df_var) == varX)] <- var_name
g <-
ggplot(
data = df_var,
mapping = aes_string(x = var_name, y = "relative_count_1", fill = var_name)
) + geom_bar(stat = "identity", show.legend = FALSE) +
labs(
fill = var_name,
x = var_name,
y = 'Normalized count delayed flights',
title = paste("Normalized count delayed flights per category in", var_name)
)
if (convert_x == TRUE) {
t <- theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1,
size = 6
)
)
} else {
t <- theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(size = 6)
)
}
if (!(palette_a == FALSE)) {
color <- scale_fill_brewer(palette = palette_a)
g + t + color
} else {
g + t
}
}
Destination
# plot normalized delayed flights in destination variable
dest_plot<-gplot_var("dest", convert_x=TRUE)
dest_plot
Origin Airport
# plot normalized delayed flights in origin variable
origin_plot<-gplot_var("origin", convert_x=FALSE, palette_a="PuRd")
origin_plot
Month
# plot normalized delayed flights in month variable
colnames(flights_full_arranged_factors_all)[which(colnames(flights_full_arranged_factors_all)=="month.x")]<-"month"
month_plot<-gplot_var("month", convert_x=FALSE)
month_plot
Carrier (airline)
# plot normalized delayed flights in carrier variable
carrier_plot<-gplot_var("carrier", convert_x=FALSE)
carrier_plot
Year of Manufacturer
# plot normalized delayed flights in year variable
colnames(flights_full_arranged_factors_all)[which(colnames(flights_full_arranged_factors_all)=="year.y")]<-"year"
year_plot<-gplot_var("year", convert_x=TRUE)
year_plot
Engine Type
# plot normalized delayed flights in type variable
type_plot<-gplot_var("type", convert_x=FALSE, palette_a="Oranges")
type_plot
Manufacture
# plot normalized delayed flights in manufacturer variable
manufacturer_plot<-gplot_var("manufacturer", convert_x=TRUE)
manufacturer_plot
Model
# plot normalized delayed flights in model variable
model_plot<-gplot_var("model", convert_x=TRUE)
model_plot
Number of Engines
# plot normalized delayed flights in engines variable
engines_plot<-gplot_var("engines", convert_x=FALSE, palette_a="BuGn")
engines_plot
Visibility
# plot normalized delayed flights in visib variable
visib_plot<-gplot_var("visib", convert_x=FALSE)
visib_plot
Time Zone
# plot normalized delayed flights in tzone variable
tzone_plot<-gplot_var("tzone", convert_x=TRUE, palette_a="RdPu")
tzone_plot
Week Day
# plot normalized delayed flights in w_day variable
w_day_plot<-gplot_var("w_day", convert_x=FALSE, palette_a="YlGn")
w_day_plot
Week Number
# plot normalized delayed flights in week_num variable
week_num_plot<-gplot_var("week_num", convert_x=FALSE)
week_num_plot
Manu- Model
# plot normalized delayed flights in manu_model variable
manu_model_plot<-gplot_var("manu_model", convert_x=TRUE)
manu_model_plot
We have seen that overall there is not one particular variable that has an apparent direct relationship to the departure delay. The following plot aims to examine the option of multiple variables together having a more apparent relation to the departure delay.
wind_dir_df<- flights_full_arranged %>% group_by(origin, wind_dir) %>% summarize(total_count=n(), mean_speed = mean(wind_speed))
df_wind<-flights_full_arranged %>% group_by(wind_dir, dep_delay) %>% tally() %>% filter(dep_delay==1, !is.na(wind_dir))
df_wind<-merge(df_wind, wind_dir_df, by= 'wind_dir')
df_wind <- df_wind %>% mutate(relative_count_1=n/total_count)
ggplot(df_wind, aes(x = wind_dir, y = relative_count_1, fill = mean_speed)) +
geom_col(binwidth = 15,
boundary = -7.5,
position = 'stack') +
coord_polar(theta = "x", direction = 2) +
facet_wrap( ~ origin, nrow = 1) +
labs(x = "Wind Direction",
y = 'normalized delay (num delay/total flights)',
title = "Departure Delay and Wind Speed per Wind Direction and Origin",
fill = 'Ave. Wind Speed [mph]') +
theme(plot.title = element_text(face = "bold.italic", hjust = 0.5)) +
scale_fill_distiller(palette = 'YlOrRd', direction = 1)
In order to optimize our model, we decreased the number of categories (levels) in factorial variables that are identified with a large number of categories. This step was performed using a statistical permutation test on each variable separately. Our assumption is that in a particular variable, a category that belongs to a delayed flight, not by chance, should be considered individually in our model since it can indicate the possibility of delay of a future flight related to this category. However, a category that repeats on random realizations with the same or even larger prevalence of delayed flights will probably fail to reject the null hypothesis that delayed flights are not over-represented in the current category in the original dataset. Thus, the insignificant categories of a variable should be combined into one aggregated category.
We performed 2000 realizations of the original dataset: During each realization, we randomly shuffled the departure time delay (‘dep_delay’) column and counted how many flights labeled as delayed (with dep_delay=1) for each category in each variable separately. Next, we compared it to the original count of departure delayed flights in the particular category. The P-value for a given category was calculated as the number of realizations when the count of departure delayed flights was equal to or even higher than the original delayed flights count in this category (plus one), divided by the total number of realizations. For example, a permutation analysis p-value of 1e-3 (in case of 1000 total realizations) means that the null hypothesis was rejected in all the realizations, suggesting that the specific category is significant in distinguishing delayed flights in the original dataset, using a confidence of 0.005. Finally, the permutation analysis p-value was corrected for multiple testing using the Benjamini-Hochberg procedure.
The P-value for a given category was calculated as the number of realizations when the count of departure delayed flights was equal to or higher than the original delayed flights count in this category (plus one), divided by the total number of realizations. For example, a permutation analysis p-value of 1e-3 (in case of 1000 total realizations) means that the null hypothesis was rejected in all the realizations, suggesting that the specific category is significant in distinguishing delayed flights in the original dataset, using a confidence of 0.005. Finally, the permutation analysis p-value was corrected for multiple testing using the Benjamini-Hochberg procedure.
We applied this step to several variables: * “model” variable, which included 127 categories, and after the permutation test had only 13 categories. * “manufacturer” variable, which initially had 35 categories and finally had 4. * “manu_model” included 147 categories at the beginning and 12 at the end of the permutation test. * “seats” included 48 categories at the beginning and 8 at the end of the permutation test. * “dest” variable included at first 100 categories, and after the permutation test had only 52 categories.
We provide in this project the complete output data frames observed from the permutation analysis. There are 5 data frames - each variable has its own data frame that includes the number of realizations (plus one) in which delayed flight counts were equal to or higher than the original counts for each category. We visualized the results for each variable in bar plots representing p-value adjusted values (after Benjamini-Hochberg correction) of each category. A line representing FDR=0.05 is presented on the plot as well in order to emphasize cases (categories) when the null hypothesis was rejected. In addition, we generated plots of total flights in each category to compare p-value results to the total amount of flights in each category.
# permutation function
levels_model <-
levels(factor(flights_full_arranged$model)) #levels of model
levels_manufacturer <-
levels(factor(flights_full_arranged$manufacturer)) #levels of manufacturer
levels_manu_model <-
levels(factor(flights_full_arranged$manu_model)) #levels of manu_model
levels_dest <-
levels(factor(flights_full_arranged$dest)) #levels of destinations
levels_seats <-
levels(factor(flights_full_arranged$seats)) #levels of seats
run_permutations = FALSE
if (run_permutations) {
origin_num_delay_var <- function(var_name, levels_var) {
origin_delay_vec <-
sapply(
levels_var,
simplify = TRUE,
FUN = function(one_level) {
df_summarize <-
flights_full_arranged %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
FALSE) %>% tally
number_delay_in_level <-
df_summarize$n[df_summarize$dep_delay == 1]
number_delay_in_level
}
)
origin_delay_vec
}
origin_num_delay_model <-
origin_num_delay_var('model', levels_model)
origin_num_delay_manufacturer <-
origin_num_delay_var('manufacturer', levels_manufacturer)
origin_num_delay_manu_model <-
origin_num_delay_var('manu_model', levels_manu_model)
origin_num_delay_dest <- origin_num_delay_var('dest', levels_dest)
origin_num_delay_seats <-
origin_num_delay_var('seats', levels_seats)
perm_vec_var <- function(levels_var_model,
origin_num_delay_var) {
perm_vec <- rep(1, length(levels_var_model))
perm_vec <- setNames(perm_vec, names(origin_num_delay_var))
}
perm_ndelay_vec_model <-
perm_vec_var(levels_model, origin_num_delay_model)
perm_ndelay_vec_manufacturer <-
perm_vec_var(levels_manufacturer, origin_num_delay_manufacturer)
perm_ndelay_vec_manu_model <-
perm_vec_var(levels_manu_model, origin_num_delay_manu_model)
perm_ndelay_vec_dest <-
perm_vec_var(levels_dest, origin_num_delay_dest)
perm_ndelay_vec_seats <-
perm_vec_var(levels_seats, origin_num_delay_seats)
create_perm_ndelay_vec_var <-
function(flights_full_perm,
var_name,
levels_var,
origin_num_delay_var,
perm_vec_var) {
perm_num_delay_var <-
sapply(
levels_var,
simplify = TRUE,
FUN = function(one_level) {
df_summarize <-
flights_full_perm %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
FALSE) %>% tally
number_delay_in_level <-
df_summarize$n[df_summarize$dep_delay == 1]
number_delay_in_level
}
)
ind_greater <- which(perm_num_delay_var >= origin_num_delay_var)
perm_vec_var[ind_greater] <- perm_vec_var[ind_greater] + 1
perm_vec_var
}
num_perm <- 2000
for (iter in 1:num_perm) {
print(iter)
flights_full_perm <-
transform(flights_full_arranged, dep_delay = sample(dep_delay))
perm_ndelay_vec_model <-
create_perm_ndelay_vec_var(
flights_full_perm,
'model',
levels_model,
origin_num_delay_model,
perm_ndelay_vec_model
)
perm_ndelay_vec_manufacturer <-
create_perm_ndelay_vec_var(
flights_full_perm,
'manufacturer',
levels_manufacturer,
origin_num_delay_manufacturer,
perm_ndelay_vec_manufacturer
)
perm_ndelay_vec_manu_model <-
create_perm_ndelay_vec_var(
flights_full_perm,
'manu_model',
levels_manu_model,
origin_num_delay_manu_model,
perm_ndelay_vec_manu_model
)
perm_ndelay_vec_dest <-
create_perm_ndelay_vec_var(
flights_full_perm,
'dest',
levels_dest,
origin_num_delay_dest,
perm_ndelay_vec_dest
)
perm_ndelay_vec_seats <-
create_perm_ndelay_vec_var(
flights_full_perm,
'seats',
levels_seats,
origin_num_delay_seats,
perm_ndelay_vec_seats
)
}
#save perm_ndelay_vec_var output from permutations to csv (as tibble data frmae)
perm_ndelay_vec_model_df = tibble(name = names(perm_ndelay_vec_model), value = perm_ndelay_vec_model)
write.table(
perm_ndelay_vec_model_df ,
file = "../output/perm_ndelay_vec_model_df.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_manufacturer_df = tibble(name = names(perm_ndelay_vec_manufacturer),
value = perm_ndelay_vec_manufacturer)
write.table(
perm_ndelay_vec_manufacturer_df ,
file = "../output/perm_ndelay_vec_manufacturer_df.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_manu_model_df = tibble(name = names(perm_ndelay_vec_manu_model),
value = perm_ndelay_vec_manu_model)
write.table(
perm_ndelay_vec_manu_model_df ,
file = "../output/perm_ndelay_vec_manu_model_df.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_dest_df = tibble(name = names(perm_ndelay_vec_dest), value = perm_ndelay_vec_dest)
write.table(
perm_ndelay_vec_dest_df ,
file = "../output/perm_ndelay_vec_dest_df.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_seats_df = tibble(name = names(perm_ndelay_vec_seats), value = perm_ndelay_vec_seats)
write.table(
perm_ndelay_vec_seats_df ,
file = "../output/perm_ndelay_vec_seats_df.csv",
sep = ",",
row.names = FALSE
)
}
Load output data frames of permutation test and display plots of p-val and of flights count for each level
num_perm <- 2000
perm_result_var_df <- function(var_name) {
perm_ndelay_var_df <-
read.table(
file = paste(
"../output/perm_ndelay_merged_vec_",
var_name,
"_df.csv",
sep = ""
),
sep = ",",
header = TRUE
)
colnames(perm_ndelay_var_df)[1] <- var_name
perm_ndelay_var_df$p_val <- perm_ndelay_var_df$value / num_perm
perm_ndelay_var_df$p_adj <-
p.adjust(perm_ndelay_var_df$p_val, method = "BH")
flights_counts_var_df <-
flights_full_arranged %>% group_by((!!sym(var_name))) %>% summarise(total_counts =
n())
perm_ndelay_var_df <-
merge(perm_ndelay_var_df, flights_counts_var_df, by = var_name) #add flights counts per level
perm_ndelay_var_df <- perm_ndelay_var_df %>% arrange(p_adj)
p_val_plot <-
ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
var_name
)), p_adj) , y = p_adj)) + geom_bar(stat = "identity") +
labs(
title = paste("p-value adjusted per", var_name, "level"),
x = var_name,
y = "p-value adjusted"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1,
size = 6
)
) +
geom_hline(aes(yintercept = 0.05, color = "0.05"),
linetype = "dashed",
size = 1.5) +
scale_color_manual(name = "FDR cutoff", values = c("0.05" = "red"))
flights_counts_plot <-
ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
var_name
)), p_adj) , y = total_counts)) +
geom_bar(stat = "identity") + labs(
title = paste("flights counts per", var_name, "level"),
x = var_name,
y = "flights counts"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1,
size = 6
)
)
return(list(perm_ndelay_var_df, p_val_plot, flights_counts_plot))
}
Permutation - model
perm_model_df <- perm_result_var_df("model")
perm_ndelay_model_df <- perm_model_df[[1]]
perm_model_df[[2]]
perm_model_df[[3]]
Permutation - manufacturer
perm_manufacturer_df <- perm_result_var_df("manufacturer")
perm_ndelay_manufacturer_df <- perm_manufacturer_df[[1]]
perm_manufacturer_df[[2]]
perm_manufacturer_df[[3]]
Permutation - manu_model
perm_manu_model_df <- perm_result_var_df("manu_model")
perm_ndelay_manu_model_df <- perm_manu_model_df[[1]]
perm_manu_model_df[[2]]
perm_manu_model_df[[3]]
Permutation - dest
perm_dest_df <- perm_result_var_df("dest")
perm_ndelay_dest_df <- perm_dest_df[[1]]
perm_dest_df[[2]]
perm_dest_df[[3]]
Permutation - seats
perm_seats_df <- perm_result_var_df("seats")
perm_ndelay_seats_df <- perm_seats_df[[1]]
perm_seats_df[[2]]
perm_seats_df[[3]]
Following the permutation test and its visualization, in each variable column (5 variables), we replaced the original levels (categories) with new levels. Those who passed the Benjamini-Hochberg multiple test correction (FDR<0.05) were kept, while the insignificant categories were converted to a single new category named ‘other’. Categories in ‘model’, ‘manufacturer’, and ‘manu_model’ variables were changed to one-letter levels, as well as ‘dest’ variable, whose categories were replaced with numbers for visualization improvement purposes.
Change model levels in flights_full_arranged
flights_full_arranged$model <-
as.character(flights_full_arranged$model)
models_to_change <-
perm_ndelay_model_df$model[which(perm_ndelay_model_df$p_adj <= 0.05)] #filtered levels (were kept)
models_to_change
## [1] "717-200" "737-7H4" "757-324" "CL-600-2B19" "CL-600-2C10"
## [6] "CL-600-2D24" "EMB-145" "EMB-145LR" "EMB-145XR" "A319-115"
## [11] "777-224" "737-7BD"
flights_full_arranged <-
flights_full_arranged %>% mutate(model = replace(
model,!(flights_full_arranged$model %in% models_to_change),
'other'
))
flights_full_arranged$model <-
as.factor(flights_full_arranged$model) #convert back to factor variable
levels(flights_full_arranged$model)
## [1] "717-200" "737-7BD" "737-7H4" "757-324" "777-224"
## [6] "A319-115" "CL-600-2B19" "CL-600-2C10" "CL-600-2D24" "EMB-145"
## [11] "EMB-145LR" "EMB-145XR" "other"
Change model levels into one-letter categories
model_levels_after_change <-
data.frame(model = levels(flights_full_arranged$model))
model_levels_after_change$new_name <-
letters[1:nrow(model_levels_after_change)]
model_levels_after_change$new_name[model_levels_after_change$model == "other"] <-
"other"
model_letter_cat <- lapply(flights_full_arranged$model, function(row) {
y <-
model_levels_after_change$new_name[which(row == model_levels_after_change$model)]
y
})
flights_full_arranged$model <- as.factor(unlist(model_letter_cat))
levels(flights_full_arranged$model)
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
## [10] "j" "k" "l" "other"
Change manufacturer levels in flights_full_arranged
flights_full_arranged$manufacturer <-
as.character(flights_full_arranged$manufacturer)
manufacturers_to_change <-
perm_ndelay_manufacturer_df$manufacturer[which(perm_ndelay_manufacturer_df$p_adj <=
0.05)] #filtered levels (were kept)
manufacturers_to_change
## [1] "BOMBARDIER INC" "CANADAIR" "EMBRAER"
flights_full_arranged <-
flights_full_arranged %>% mutate(manufacturer = replace(
manufacturer,!(
flights_full_arranged$manufacturer %in% manufacturers_to_change
),
'other'
))
flights_full_arranged$manufacturer <-
as.factor(flights_full_arranged$manufacturer) #convert back to factor variable
levels(flights_full_arranged$manufacturer)
## [1] "BOMBARDIER INC" "CANADAIR" "EMBRAER" "other"
Change manufacturer levels into one-letter categories
manufacturer_levels_after_change <-
data.frame(manufacturer = levels(flights_full_arranged$manufacturer))
manufacturer_levels_after_change$new_name <-
letters[1:nrow(manufacturer_levels_after_change)]
manufacturer_levels_after_change$new_name[manufacturer_levels_after_change$manufacturer == "other"] <-
"other"
manufacturer_letter_cat <- lapply(flights_full_arranged$manufacturer, function(row) {
y <-
manufacturer_levels_after_change$new_name[which(row == manufacturer_levels_after_change$manufacturer)]
y
})
flights_full_arranged$manufacturer <- as.factor(unlist(manufacturer_letter_cat))
levels(flights_full_arranged$manufacturer)
## [1] "a" "b" "c" "other"
Change manu_model levels in flights_full_arranged
flights_full_arranged$manu_model <-
as.character(flights_full_arranged$manu_model)
manu_models_to_change <-
perm_ndelay_manu_model_df$manu_model[which(perm_ndelay_manu_model_df$p_adj <=
0.05)] #filtered levels (were kept)
manu_models_to_change
## [1] "BOEING_717-200" "BOEING_737-7H4"
## [3] "BOEING_757-324" "BOMBARDIER INC_CL-600-2B19"
## [5] "BOMBARDIER INC_CL-600-2C10" "BOMBARDIER INC_CL-600-2D24"
## [7] "CANADAIR_CL-600-2B19" "EMBRAER_EMB-145"
## [9] "EMBRAER_EMB-145LR" "EMBRAER_EMB-145XR"
## [11] "AIRBUS_A319-115" "BOEING_777-224"
flights_full_arranged <-
flights_full_arranged %>% mutate(manu_model = replace(
manu_model,!(flights_full_arranged$manu_model %in% manu_models_to_change),
'other'
))
flights_full_arranged$manu_model <-
as.factor(flights_full_arranged$manu_model) #convert back to factor variable
levels(flights_full_arranged$manu_model)
## [1] "AIRBUS_A319-115" "BOEING_717-200"
## [3] "BOEING_737-7H4" "BOEING_757-324"
## [5] "BOEING_777-224" "BOMBARDIER INC_CL-600-2B19"
## [7] "BOMBARDIER INC_CL-600-2C10" "BOMBARDIER INC_CL-600-2D24"
## [9] "CANADAIR_CL-600-2B19" "EMBRAER_EMB-145"
## [11] "EMBRAER_EMB-145LR" "EMBRAER_EMB-145XR"
## [13] "other"
Change manu_model column into one-letter categories
flights_full_arranged_bf_change <- flights_full_arranged
manu_model_levels_after_change <-
data.frame(manu_model = levels(flights_full_arranged$manu_model))
manu_model_levels_after_change$new_name <-
letters[1:nrow(manu_model_levels_after_change)]
manu_model_levels_after_change$new_name[manu_model_levels_after_change$manu_model ==
"other"] <-
"other"
manu_model_letter_cat <-
lapply(flights_full_arranged$manu_model, function(row) {
y <-
manu_model_levels_after_change$new_name[which(row == manu_model_levels_after_change$manu_model)]
y
})
flights_full_arranged$manu_model <-
as.factor(unlist(manu_model_letter_cat))
levels(flights_full_arranged$manu_model)
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
## [10] "j" "k" "l" "other"
Change destination levels in flights_full_arranged
flights_full_arranged$dest <-
as.character(flights_full_arranged$dest)
dests_to_change <-
perm_ndelay_dest_df$dest[which(perm_ndelay_dest_df$p_adj <= 0.05)] #filtered levels (were kept)
dests_to_change
## [1] "ALB" "BGR" "BHM" "BNA" "BTV" "BWI" "CAE" "CAK" "CLE" "CVG" "DAY" "DSM"
## [13] "GRR" "GSO" "GSP" "IAD" "ILM" "JAX" "MCI" "MDW" "MEM" "MHT" "MKE" "MSN"
## [25] "OKC" "OMA" "ORF" "PVD" "PWM" "RDU" "RIC" "ROC" "SAT" "SDF" "STL" "TUL"
## [37] "TYS" "BDL" "CHS" "SMF" "SAV" "PDX" "JAC" "SYR" "TVC" "ABQ" "PIT" "CHO"
## [49] "IND" "DEN" "BUF"
flights_full_arranged <-
flights_full_arranged %>% mutate(dest = replace(
dest,!(flights_full_arranged$dest %in% dests_to_change),
'other'
))
flights_full_arranged$dest <-
as.factor(flights_full_arranged$dest) #convert back to factor variable
levels(flights_full_arranged$dest)
## [1] "ABQ" "ALB" "BDL" "BGR" "BHM" "BNA" "BTV" "BUF" "BWI"
## [10] "CAE" "CAK" "CHO" "CHS" "CLE" "CVG" "DAY" "DEN" "DSM"
## [19] "GRR" "GSO" "GSP" "IAD" "ILM" "IND" "JAC" "JAX" "MCI"
## [28] "MDW" "MEM" "MHT" "MKE" "MSN" "OKC" "OMA" "ORF" "other"
## [37] "PDX" "PIT" "PVD" "PWM" "RDU" "RIC" "ROC" "SAT" "SAV"
## [46] "SDF" "SMF" "STL" "SYR" "TUL" "TVC" "TYS"
Change destination column into number categories
dest_levels_after_change <-
data.frame(dest = levels(flights_full_arranged$dest))
dest_levels_after_change$new_name <-
1:nrow(dest_levels_after_change)
dest_number_cat <-
lapply(flights_full_arranged$dest, function(row) {
y <- which(row == dest_levels_after_change$dest)
y
})
flights_full_arranged$dest <- as.factor(unlist(dest_number_cat))
levels(flights_full_arranged$dest)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [46] "46" "47" "48" "49" "50" "51" "52"
Change seats levels in flights_full_arranged
flights_full_arranged$seats <-
as.character(flights_full_arranged$seats)
seats_to_change <-
perm_ndelay_seats_df$seats[which(perm_ndelay_seats_df$p_adj <= 0.05)] #filtered levels (were kept)
seats_to_change
## [1] 55 80 95 100 140 147 275
flights_full_arranged <-
flights_full_arranged %>% mutate(seats = replace(
seats,!(flights_full_arranged$seats %in% seats_to_change),
'other'
))
flights_full_arranged$seats <-
as.factor(flights_full_arranged$seats) #convert back to factor variable
levels(flights_full_arranged$seats)
## [1] "100" "140" "147" "275" "55" "80" "95" "other"
In order to check the sensitivity of our prediction models to the threshold we chose to distinguish between delayed and non-delayed flights, we will generate two additional datasets, one with a higher threshold - 25 minutes, and one with a lower threshold - 15 minutes. Some of the pre-processing depends on the labels assigned to each flight (0- non-delay, 1- delay). Therefore, we will perform the necessary pre-processing steps to ensure that the data is coherent.
Threshold = 15 minutes
#new data frame for sensitivity check, threshold = 15
flights_full_15 <- flights_full
#divide the flights into 2 groups according to their dep_delay - flights above 15 min delay, and flights above -10 & until 15 min delay
flights_full_new_dep_delay_15 <-
flights_full_15[which(flights_full_15$dep_delay > -10), ]
flights_full_arranged_15 <-
flights_full_new_dep_delay_15 %>% arrange(dep_delay)
# plot histogram of original dep_delay before changing it into 2 categories with the threshold
ggplot(flights_full_arranged_15, aes(x = dep_delay)) +
geom_histogram(color = "black", fill = "white", bins = 40) +
geom_vline(aes(xintercept = 15, color = "delay > 15 min"),
linetype = "dashed",
size = 1.3) +
scale_color_manual(name = "Treshold delay time", values = c("delay > 15 min" = "red")) +
labs(
x = "Departure delay time [min]",
y = "counts of flights",
title = paste('Histogram of departure delay time')
) +
theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))
# percentage of dep_delay=0
(length(which(
flights_full_arranged_15$dep_delay < (15)
))) / nrow(flights_full_arranged_15)
## [1] 0.7662873
# percentage of dep_delay=1
(length(which(
flights_full_arranged_15$dep_delay >= (15)
))) / nrow(flights_full_arranged_15)
## [1] 0.2337127
#change dep_delay column into categories (0 / 1)
flights_full_arranged_15 <-
flights_full_arranged_15 %>% mutate(dep_delay = case_when(dep_delay < 15 ~ 0,
dep_delay >= 15 ~ 1))
#convert dep_delay to factor column
flights_full_arranged_15$dep_delay <-
as.factor(flights_full_arranged_15$dep_delay)
#plot flights counts per 2 dep_delay categories
ggplot(flights_full_arranged_15, aes(dep_delay, fill = dep_delay)) + geom_bar(stat="count") +
scale_fill_manual(labels=c("0 = no delay", "1 = delay"), values = c('#CC6666', '#FFCCCC')) +
labs(title = "Flights counts per departure delay category (15)", x = "Departure delay categories") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# merge manufacturer and model columns into one column
manu_model <-
paste(flights_full_arranged_15$manufacturer,
flights_full_arranged_15$model,
sep = "_")
flights_full_arranged_15$manu_model <- manu_model
# permutation function - skip it because it is already done
levels_model <-
levels(factor(flights_full_arranged_15$model)) #levels of model
levels_manufacturer <-
levels(factor(flights_full_arranged_15$manufacturer)) #levels of manufacturer
levels_manu_model <-
levels(factor(flights_full_arranged_15$manu_model)) #levels of manu_model
levels_dest <-
levels(factor(flights_full_arranged_15$dest)) #levels of destinations
levels_seats <-
levels(factor(flights_full_arranged_15$seats)) #levels of seats
run_permutations = FALSE
if (run_permutations) {
origin_num_delay_var <- function(var_name, levels_var) {
origin_delay_vec <-
sapply(
levels_var,
simplify = TRUE,
FUN = function(one_level) {
df_summarize <-
flights_full_arranged_15 %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
FALSE) %>% tally
number_delay_in_level <-
df_summarize$n[df_summarize$dep_delay == 1]
number_delay_in_level
}
)
origin_delay_vec
}
origin_num_delay_model <-
origin_num_delay_var('model', levels_model)
origin_num_delay_manufacturer <-
origin_num_delay_var('manufacturer', levels_manufacturer)
origin_num_delay_manu_model <-
origin_num_delay_var('manu_model', levels_manu_model)
origin_num_delay_dest <- origin_num_delay_var('dest', levels_dest)
origin_num_delay_seats <-
origin_num_delay_var('seats', levels_seats)
perm_vec_var <- function(levels_var_model,
origin_num_delay_var) {
perm_vec <- rep(1, length(levels_var_model))
perm_vec <- setNames(perm_vec, names(origin_num_delay_var))
}
perm_ndelay_vec_model <-
perm_vec_var(levels_model, origin_num_delay_model)
perm_ndelay_vec_manufacturer <-
perm_vec_var(levels_manufacturer, origin_num_delay_manufacturer)
perm_ndelay_vec_manu_model <-
perm_vec_var(levels_manu_model, origin_num_delay_manu_model)
perm_ndelay_vec_dest <-
perm_vec_var(levels_dest, origin_num_delay_dest)
perm_ndelay_vec_seats <-
perm_vec_var(levels_seats, origin_num_delay_seats)
create_perm_ndelay_vec_var <-
function(flights_full_perm,
var_name,
levels_var,
origin_num_delay_var,
perm_vec_var) {
perm_num_delay_var <-
sapply(
levels_var,
simplify = TRUE,
FUN = function(one_level) {
df_summarize <-
flights_full_perm %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
FALSE) %>% tally
number_delay_in_level <-
df_summarize$n[df_summarize$dep_delay == 1]
number_delay_in_level
}
)
ind_greater <- which(perm_num_delay_var >= origin_num_delay_var)
perm_vec_var[ind_greater] <- perm_vec_var[ind_greater] + 1
perm_vec_var
}
num_perm <- 1000
for (iter in 1:num_perm) {
print(iter)
flights_full_perm <-
transform(flights_full_arranged_15, dep_delay = sample(dep_delay))
perm_ndelay_vec_model <-
create_perm_ndelay_vec_var(
flights_full_perm,
'model',
levels_model,
origin_num_delay_model,
perm_ndelay_vec_model
)
perm_ndelay_vec_manufacturer <-
create_perm_ndelay_vec_var(
flights_full_perm,
'manufacturer',
levels_manufacturer,
origin_num_delay_manufacturer,
perm_ndelay_vec_manufacturer
)
perm_ndelay_vec_manu_model <-
create_perm_ndelay_vec_var(
flights_full_perm,
'manu_model',
levels_manu_model,
origin_num_delay_manu_model,
perm_ndelay_vec_manu_model
)
perm_ndelay_vec_dest <-
create_perm_ndelay_vec_var(
flights_full_perm,
'dest',
levels_dest,
origin_num_delay_dest,
perm_ndelay_vec_dest
)
perm_ndelay_vec_seats <-
create_perm_ndelay_vec_var(
flights_full_perm,
'seats',
levels_seats,
origin_num_delay_seats,
perm_ndelay_vec_seats
)
}
#save perm_ndelay_vec_var output from permutations to csv (as tibble data frmae)
perm_ndelay_vec_model_df = tibble(name = names(perm_ndelay_vec_model), value = perm_ndelay_vec_model)
write.table(
perm_ndelay_vec_model_df ,
file = "../output/perm_ndelay_merged_vec_model_df_15.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_manufacturer_df = tibble(name = names(perm_ndelay_vec_manufacturer),
value = perm_ndelay_vec_manufacturer)
write.table(
perm_ndelay_vec_manufacturer_df ,
file = "../output/perm_ndelay_merged_vec_manufacturer_df_15.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_manu_model_df = tibble(name = names(perm_ndelay_vec_manu_model),
value = perm_ndelay_vec_manu_model)
write.table(
perm_ndelay_vec_manu_model_df ,
file = "../output/final_project/perm_ndelay_merged_vec_manu_model_df_15.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_dest_df = tibble(name = names(perm_ndelay_vec_dest), value = perm_ndelay_vec_dest)
write.table(
perm_ndelay_vec_dest_df ,
file = "../output/perm_ndelay_merged_vec_dest_df_15.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_seats_df = tibble(name = names(perm_ndelay_vec_seats), value = perm_ndelay_vec_seats)
write.table(
perm_ndelay_vec_seats_df ,
file = "../output/perm_ndelay_merged_vec_seats_df_15.csv",
sep = ",",
row.names = FALSE
)
}
#display plots of p-val for each level and flights count for each level
num_perm <- 1000
perm_result_var_df <- function(var_name) {
perm_ndelay_var_df <-
read.table(
file = paste(
"../output/perm_ndelay_merged_vec_",
var_name,
"_df_15.csv",
sep = ""
),
sep = ",",
header = TRUE
)
colnames(perm_ndelay_var_df)[1] <- var_name
perm_ndelay_var_df$p_val <- perm_ndelay_var_df$value / num_perm
perm_ndelay_var_df$p_adj <-
p.adjust(perm_ndelay_var_df$p_val, method = "BH")
flights_counts_var_df <-
flights_full_arranged_15 %>% group_by((!!sym(var_name))) %>% summarise(total_counts =
n())
perm_ndelay_var_df <-
merge(perm_ndelay_var_df, flights_counts_var_df, by = var_name) #add flights counts per level
perm_ndelay_var_df <- perm_ndelay_var_df %>% arrange(p_adj)
p_val_plot <-
ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
var_name
)), p_adj) , y = p_adj)) + geom_bar(stat = "identity") +
labs(
title = paste("p-value adjusted per", var_name, "level"),
x = var_name,
y = "p-value adjusted"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1,
size = 6
)
) +
geom_hline(aes(yintercept = 0.05, color = "0.05"),
linetype = "dashed",
size = 1.5) +
scale_color_manual(name = "FDR cutoff", values = c("0.05" = "red"))
flights_counts_plot <-
ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
var_name
)), p_adj) , y = total_counts)) +
geom_bar(stat = "identity") + labs(
title = paste("flights counts per", var_name, "level"),
x = var_name,
y = "flights counts"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1,
size = 6
)
)
return(list(perm_ndelay_var_df, p_val_plot, flights_counts_plot))
}
#permutation - model
perm_model_df <- perm_result_var_df("model")
perm_ndelay_model_df <- perm_model_df[[1]]
#permutation - manufacturer
perm_manufacturer_df <- perm_result_var_df("manufacturer")
perm_ndelay_manufacturer_df <- perm_manufacturer_df[[1]]
#permutation - manu_model
perm_manu_model_df <- perm_result_var_df("manu_model")
perm_ndelay_manu_model_df <- perm_manu_model_df[[1]]
#permutation - dest
perm_dest_df <- perm_result_var_df("dest")
perm_ndelay_dest_df <- perm_dest_df[[1]]
#permutation - seats
perm_seats_df <- perm_result_var_df("seats")
perm_ndelay_seats_df <- perm_seats_df[[1]]
#change model levels in flights_full_arranged_15
flights_full_arranged_15$model <-
as.character(flights_full_arranged_15$model)
models_to_change <-
perm_ndelay_model_df$model[which(perm_ndelay_model_df$p_adj <= 0.05)] #filtered levels (were kept)
flights_full_arranged_15 <-
flights_full_arranged_15 %>% mutate(model = replace(
model,
!(flights_full_arranged_15$model %in% models_to_change),
'other_model'
))
flights_full_arranged_15$model <-
as.factor(flights_full_arranged_15$model) #convert back to factor variable
#change model levels into one-letter categories
flights_full_arranged_15_bf_change <- flights_full_arranged_15
model_levels_after_change <-
data.frame(model = levels(flights_full_arranged_15$model))
model_levels_after_change$new_name <-
letters[1:nrow(model_levels_after_change)]
model_levels_after_change$new_name[model_levels_after_change$model == "other_model"] <-
"other"
model_letter_cat <-
lapply(flights_full_arranged_15$model, function(row) {
y <-
model_levels_after_change$new_name[which(row == model_levels_after_change$model)]
y
})
flights_full_arranged_15$model <- as.factor(unlist(model_letter_cat))
#change manufacturer levels in flights_full_arranged_15
flights_full_arranged_15$manufacturer <-
as.character(flights_full_arranged_15$manufacturer)
manufacturers_to_change <-
perm_ndelay_manufacturer_df$manufacturer[which(perm_ndelay_manufacturer_df$p_adj <=
0.05)] #filtered levels (were kept)
flights_full_arranged_15 <-
flights_full_arranged_15 %>% mutate(manufacturer = replace(
manufacturer,
!(
flights_full_arranged_15$manufacturer %in% manufacturers_to_change
),
'other_manufacturer'
))
flights_full_arranged_15$manufacturer <-
as.factor(flights_full_arranged_15$manufacturer) #convert back to factor variable
#change manu_model levels in flights_full_arranged_15
flights_full_arranged_15$manu_model <-
as.character(flights_full_arranged_15$manu_model)
manu_models_to_change <-
perm_ndelay_manu_model_df$manu_model[which(perm_ndelay_manu_model_df$p_adj <=
0.05)] #filtered levels (were kept)
flights_full_arranged_15 <-
flights_full_arranged_15 %>% mutate(manu_model = replace(
manu_model,
!(
flights_full_arranged_15$manu_model %in% manu_models_to_change
),
'other_manu_model'
))
flights_full_arranged_15$manu_model <-
as.factor(flights_full_arranged_15$manu_model) #convert back to factor variable
#change manu_model column into one-letter categories
flights_full_arranged_15_bf_change <- flights_full_arranged_15
manu_model_levels_after_change <-
data.frame(manu_model = levels(flights_full_arranged_15$manu_model))
manu_model_levels_after_change$new_name <-
letters[1:nrow(manu_model_levels_after_change)]
manu_model_levels_after_change$new_name[manu_model_levels_after_change$manu_model ==
"other_manu_model"] <- "other"
manu_model_letter_cat <-
lapply(flights_full_arranged_15$manu_model, function(row) {
y <-
manu_model_levels_after_change$new_name[which(row == manu_model_levels_after_change$manu_model)]
y
})
flights_full_arranged_15$manu_model <-
as.factor(unlist(manu_model_letter_cat))
#change destination levels in flights_full_arranged_15
flights_full_arranged_15$dest <-
as.character(flights_full_arranged_15$dest)
dests_to_change <-
perm_ndelay_dest_df$dest[which(perm_ndelay_dest_df$p_adj <= 0.05)] #filtered levels (were kept)
flights_full_arranged_15 <-
flights_full_arranged_15 %>% mutate(dest = replace(
dest,
!(flights_full_arranged_15$dest %in% dests_to_change),
'other_dest'
))
flights_full_arranged_15$dest <-
as.factor(flights_full_arranged_15$dest) #convert back to factor variable
#change destination column into number categories
dest_levels_after_change <-
data.frame(dest = levels(flights_full_arranged_15$dest))
dest_levels_after_change$new_name <- 1:nrow(dest_levels_after_change)
dest_number_cat <-
lapply(flights_full_arranged_15$dest, function(row) {
y <- which(row == dest_levels_after_change$dest)
y
})
flights_full_arranged_15$dest <- as.factor(unlist(dest_number_cat))
#change seats levels in flights_full_arranged_15
flights_full_arranged_15$seats <-
as.character(flights_full_arranged_15$seats)
seats_to_change <-
perm_ndelay_seats_df$seats[which(perm_ndelay_seats_df$p_adj <= 0.05)] #filtered levels (were kept)
flights_full_arranged_15 <-
flights_full_arranged_15 %>% mutate(seats = replace(
seats,
!(flights_full_arranged_15$seats %in% seats_to_change),
'other_seats'
))
flights_full_arranged_15$seats <-
as.factor(flights_full_arranged_15$seats) #convert back to factor variable
Threshold = 25 minutes
flights_full_25 <- flights_full
#divide the flights into 2 groups according to their dep_delay - flights above 25 min delay, and flights above -10 & until 25 min delay
flights_full_new_dep_delay_25 <-
flights_full_25[which(flights_full_25$dep_delay > -10), ]
flights_full_arranged_25 <-
flights_full_new_dep_delay_25 %>% arrange(dep_delay)
# plot histogram of original dep_delay before changing it into 2 categories with the threshold
ggplot(flights_full_arranged_25, aes(x = dep_delay)) +
geom_histogram(color = "black", fill = "white", bins = 40) +
geom_vline(aes(xintercept = 25, color = "delay > 25 min"),
linetype = "dashed",
size = 1.3) +
scale_color_manual(name = "Treshold delay time", values = c("delay > 25 min" = "red")) +
labs(
x = "Departure delay time [min]",
y = "counts of flights",
title = paste('Histogram of departure delay time')
) +
theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))
# percentage of dep_delay=0
(length(which(
flights_full_arranged_25$dep_delay < (25)
))) / nrow(flights_full_arranged_25)
## [1] 0.8218672
# percentage of dep_delay=1
(length(which(
flights_full_arranged_25$dep_delay >= (25)
))) / nrow(flights_full_arranged_25)
## [1] 0.1781328
#change dep_delay column into categories (0 / 1)
flights_full_arranged_25 <-
flights_full_arranged_25 %>% mutate(dep_delay = case_when(dep_delay < 25 ~ 0,
dep_delay >= 25 ~ 1))
#convert dep_delay to factor column
flights_full_arranged_25$dep_delay <-
as.factor(flights_full_arranged_25$dep_delay)
#plot flights counts per 2 dep_delay categories
ggplot(flights_full_arranged_25, aes(dep_delay, fill = dep_delay)) + geom_bar(stat="count") +
scale_fill_manual(labels=c("0 = no delay", "1 = delay"), values = c('#CC6666', '#FFCCCC')) +
labs(title = "Flights counts per departure delay category (25)", x = "Departure delay categories") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# merge manufacturer and model columns into one column
manu_model <-
paste(flights_full_arranged_25$manufacturer,
flights_full_arranged_25$model,
sep = "_")
flights_full_arranged_25$manu_model <- manu_model
# permutation function - skip it because it is already done
levels_model <-
levels(factor(flights_full_arranged_25$model)) #levels of model
levels_manufacturer <-
levels(factor(flights_full_arranged_25$manufacturer)) #levels of manufacturer
levels_manu_model <-
levels(factor(flights_full_arranged_25$manu_model)) #levels of manu_model
levels_dest <-
levels(factor(flights_full_arranged_25$dest)) #levels of destinations
levels_seats <-
levels(factor(flights_full_arranged_25$seats)) #levels of seats
run_permutations = FALSE
if (run_permutations) {
origin_num_delay_var <- function(var_name, levels_var) {
origin_delay_vec <-
sapply(
levels_var,
simplify = TRUE,
FUN = function(one_level) {
df_summarize <-
flights_full_arranged_25 %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
FALSE) %>% tally
number_delay_in_level <-
df_summarize$n[df_summarize$dep_delay == 1]
number_delay_in_level
}
)
origin_delay_vec
}
origin_num_delay_model <-
origin_num_delay_var('model', levels_model)
origin_num_delay_manufacturer <-
origin_num_delay_var('manufacturer', levels_manufacturer)
origin_num_delay_manu_model <-
origin_num_delay_var('manu_model', levels_manu_model)
origin_num_delay_dest <- origin_num_delay_var('dest', levels_dest)
origin_num_delay_seats <-
origin_num_delay_var('seats', levels_seats)
perm_vec_var <- function(levels_var_model,
origin_num_delay_var) {
perm_vec <- rep(1, length(levels_var_model))
perm_vec <- setNames(perm_vec, names(origin_num_delay_var))
}
perm_ndelay_vec_model <-
perm_vec_var(levels_model, origin_num_delay_model)
perm_ndelay_vec_manufacturer <-
perm_vec_var(levels_manufacturer, origin_num_delay_manufacturer)
perm_ndelay_vec_manu_model <-
perm_vec_var(levels_manu_model, origin_num_delay_manu_model)
perm_ndelay_vec_dest <-
perm_vec_var(levels_dest, origin_num_delay_dest)
perm_ndelay_vec_seats <-
perm_vec_var(levels_seats, origin_num_delay_seats)
create_perm_ndelay_vec_var <-
function(flights_full_perm,
var_name,
levels_var,
origin_num_delay_var,
perm_vec_var) {
perm_num_delay_var <-
sapply(
levels_var,
simplify = TRUE,
FUN = function(one_level) {
df_summarize <-
flights_full_perm %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
FALSE) %>% tally
number_delay_in_level <-
df_summarize$n[df_summarize$dep_delay == 1]
number_delay_in_level
}
)
ind_greater <- which(perm_num_delay_var >= origin_num_delay_var)
perm_vec_var[ind_greater] <- perm_vec_var[ind_greater] + 1
perm_vec_var
}
num_perm <- 1000
for (iter in 1:num_perm) {
print(iter)
flights_full_perm <-
transform(flights_full_arranged_25, dep_delay = sample(dep_delay))
perm_ndelay_vec_model <-
create_perm_ndelay_vec_var(
flights_full_perm,
'model',
levels_model,
origin_num_delay_model,
perm_ndelay_vec_model
)
perm_ndelay_vec_manufacturer <-
create_perm_ndelay_vec_var(
flights_full_perm,
'manufacturer',
levels_manufacturer,
origin_num_delay_manufacturer,
perm_ndelay_vec_manufacturer
)
perm_ndelay_vec_manu_model <-
create_perm_ndelay_vec_var(
flights_full_perm,
'manu_model',
levels_manu_model,
origin_num_delay_manu_model,
perm_ndelay_vec_manu_model
)
perm_ndelay_vec_dest <-
create_perm_ndelay_vec_var(
flights_full_perm,
'dest',
levels_dest,
origin_num_delay_dest,
perm_ndelay_vec_dest
)
perm_ndelay_vec_seats <-
create_perm_ndelay_vec_var(
flights_full_perm,
'seats',
levels_seats,
origin_num_delay_seats,
perm_ndelay_vec_seats
)
}
#save perm_ndelay_vec_var output from permutations to csv (as tibble data frmae)
perm_ndelay_vec_model_df = tibble(name = names(perm_ndelay_vec_model), value = perm_ndelay_vec_model)
write.table(
perm_ndelay_vec_model_df ,
file = "../output/perm_ndelay_merged_vec_model_df_25.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_manufacturer_df = tibble(name = names(perm_ndelay_vec_manufacturer),
value = perm_ndelay_vec_manufacturer)
write.table(
perm_ndelay_vec_manufacturer_df ,
file = "../output/perm_ndelay_merged_vec_manufacturer_df_25.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_manu_model_df = tibble(name = names(perm_ndelay_vec_manu_model),
value = perm_ndelay_vec_manu_model)
write.table(
perm_ndelay_vec_manu_model_df ,
file = "../output/perm_ndelay_merged_vec_manu_model_df_25.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_dest_df = tibble(name = names(perm_ndelay_vec_dest), value = perm_ndelay_vec_dest)
write.table(
perm_ndelay_vec_dest_df ,
file = "../output/perm_ndelay_merged_vec_dest_df_25.csv",
sep = ",",
row.names = FALSE
)
perm_ndelay_vec_seats_df = tibble(name = names(perm_ndelay_vec_seats), value = perm_ndelay_vec_seats)
write.table(
perm_ndelay_vec_seats_df ,
file = "../output/perm_ndelay_merged_vec_seats_df_25.csv",
sep = ",",
row.names = FALSE
)
}
num_perm <- 1000
perm_result_var_df <- function(var_name) {
perm_ndelay_var_df <-
read.table(
file = paste(
"../output/perm_ndelay_merged_vec_",
var_name,
"_df_25.csv",
sep = ""
),
sep = ",",
header = TRUE
)
colnames(perm_ndelay_var_df)[1] <- var_name
perm_ndelay_var_df$p_val <- perm_ndelay_var_df$value / num_perm
perm_ndelay_var_df$p_adj <-
p.adjust(perm_ndelay_var_df$p_val, method = "BH")
flights_counts_var_df <-
flights_full_arranged_25 %>% group_by((!!sym(var_name))) %>% summarise(total_counts =
n())
perm_ndelay_var_df <-
merge(perm_ndelay_var_df, flights_counts_var_df, by = var_name) #add flights counts per level
perm_ndelay_var_df <- perm_ndelay_var_df %>% arrange(p_adj)
p_val_plot <-
ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
var_name
)), p_adj) , y = p_adj)) + geom_bar(stat = "identity") +
labs(
title = paste("p-value adjusted per", var_name, "level"),
x = var_name,
y = "p-value adjusted"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1,
size = 6
)
) +
geom_hline(aes(yintercept = 0.05, color = "0.05"),
linetype = "dashed",
size = 1.5) +
scale_color_manual(name = "FDR cutoff", values = c("0.05" = "red"))
flights_counts_plot <-
ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
var_name
)), p_adj) , y = total_counts)) +
geom_bar(stat = "identity") + labs(
title = paste("flights counts per", var_name, "level"),
x = var_name,
y = "flights counts"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 19,
face = "bold"
),
axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1,
size = 6
)
)
return(list(perm_ndelay_var_df, p_val_plot, flights_counts_plot))
}
#permutation - model
perm_model_df <- perm_result_var_df("model")
perm_ndelay_model_df <- perm_model_df[[1]]
#permutation - manufacturer
perm_manufacturer_df <- perm_result_var_df("manufacturer")
perm_ndelay_manufacturer_df <- perm_manufacturer_df[[1]]
#permutation - manu_model
perm_manu_model_df <- perm_result_var_df("manu_model")
perm_ndelay_manu_model_df <- perm_manu_model_df[[1]]
#permutation - dest
perm_dest_df <- perm_result_var_df("dest")
perm_ndelay_dest_df <- perm_dest_df[[1]]
#permutation - seats
perm_seats_df <- perm_result_var_df("seats")
perm_ndelay_seats_df <- perm_seats_df[[1]]
#change model levels in flights_full_arranged_25
flights_full_arranged_25$model <-
as.character(flights_full_arranged_25$model)
models_to_change <-
perm_ndelay_model_df$model[which(perm_ndelay_model_df$p_adj <= 0.05)] #filtered levels (were kept)
flights_full_arranged_25 <-
flights_full_arranged_25 %>% mutate(model = replace(
model,
!(flights_full_arranged_25$model %in% models_to_change),
'other_model'
))
flights_full_arranged_25$model <-
as.factor(flights_full_arranged_25$model) #convert back to factor variable
#change model levels into one-letter categories
flights_full_arranged_25_bf_change <- flights_full_arranged_25
model_levels_after_change <-
data.frame(model = levels(flights_full_arranged_25$model))
model_levels_after_change$new_name <-
letters[1:nrow(model_levels_after_change)]
model_levels_after_change$new_name[model_levels_after_change$model == "other_model"] <-
"other"
model_letter_cat <-
lapply(flights_full_arranged_25$model, function(row) {
y <-
model_levels_after_change$new_name[which(row == model_levels_after_change$model)]
y
})
flights_full_arranged_25$model <- as.factor(unlist(model_letter_cat))
#change manufacturer levels in flights_full_arranged_25
flights_full_arranged_25$manufacturer <-
as.character(flights_full_arranged_25$manufacturer)
manufacturers_to_change <-
perm_ndelay_manufacturer_df$manufacturer[which(perm_ndelay_manufacturer_df$p_adj <=
0.05)] #filtered levels (were kept)
flights_full_arranged_25 <-
flights_full_arranged_25 %>% mutate(manufacturer = replace(
manufacturer,
!(
flights_full_arranged_25$manufacturer %in% manufacturers_to_change
),
'other_manufacturer'
))
flights_full_arranged_25$manufacturer <-
as.factor(flights_full_arranged_25$manufacturer) #convert back to factor variable
#change manu_model levels in flights_full_arranged_25
flights_full_arranged_25$manu_model <-
as.character(flights_full_arranged_25$manu_model)
manu_models_to_change <-
perm_ndelay_manu_model_df$manu_model[which(perm_ndelay_manu_model_df$p_adj <=
0.05)] #filtered levels (were kept)
flights_full_arranged_25 <-
flights_full_arranged_25 %>% mutate(manu_model = replace(
manu_model,
!(
flights_full_arranged_25$manu_model %in% manu_models_to_change
),
'other_manu_model'
))
flights_full_arranged_25$manu_model <-
as.factor(flights_full_arranged_25$manu_model) #convert back to factor variable
#change manu_model column into one-letter categories
flights_full_arranged_25_bf_change <- flights_full_arranged_25
manu_model_levels_after_change <-
data.frame(manu_model = levels(flights_full_arranged_25$manu_model))
manu_model_levels_after_change$new_name <-
letters[1:nrow(manu_model_levels_after_change)]
manu_model_levels_after_change$new_name[manu_model_levels_after_change$manu_model ==
"other_manu_model"] <- "other"
manu_model_letter_cat <-
lapply(flights_full_arranged_25$manu_model, function(row) {
y <-
manu_model_levels_after_change$new_name[which(row == manu_model_levels_after_change$manu_model)]
y
})
flights_full_arranged_25$manu_model <-
as.factor(unlist(manu_model_letter_cat))
#change destination levels in flights_full_arranged_25
flights_full_arranged_25$dest <-
as.character(flights_full_arranged_25$dest)
dests_to_change <-
perm_ndelay_dest_df$dest[which(perm_ndelay_dest_df$p_adj <= 0.05)] #filtered levels (were kept)
flights_full_arranged_25 <-
flights_full_arranged_25 %>% mutate(dest = replace(
dest,
!(flights_full_arranged_25$dest %in% dests_to_change),
'other_dest'
))
flights_full_arranged_25$dest <-
as.factor(flights_full_arranged_25$dest) #convert back to factor variable
#change destination column into number categories
dest_levels_after_change <-
data.frame(dest = levels(flights_full_arranged_25$dest))
dest_levels_after_change$new_name <- 1:nrow(dest_levels_after_change)
dest_number_cat <-
lapply(flights_full_arranged_25$dest, function(row) {
y <- which(row == dest_levels_after_change$dest)
y
})
flights_full_arranged_25$dest <- as.factor(unlist(dest_number_cat))
#change seats levels in flights_full_arranged_25
flights_full_arranged_25$seats <-
as.character(flights_full_arranged_25$seats)
seats_to_change <-
perm_ndelay_seats_df$seats[which(perm_ndelay_seats_df$p_adj <= 0.05)] #filtered levels (were kept)
flights_full_arranged_25 <-
flights_full_arranged_25 %>% mutate(seats = replace(
seats,
!(flights_full_arranged_25$seats %in% seats_to_change),
'other_seats'
))
flights_full_arranged_25$seats <-
as.factor(flights_full_arranged_25$seats) #convert back to factor variable
We decided to exclude ‘model’ and ‘manufacturer’ variables from our model analysis and instead we took into account the ‘manu_model’ variable. As we discussed above we combined models and manufacturers into one column represnting unique models according to their manufacturer. This step prevented future mistakenly grouping of planes with identical model name but different manufacturer. As we prefer to reduce number of predicting variables in our model to prevent confusion we removed ‘model’ and ‘manufacturer’, and let the model basing on ‘mau_model’ variable.
## [1] 25
## dest month.x sched_dep_time dep_delay sched_arr_time carrier origin distance
## 1: 1 5 1201 0 1388 B6 JFK 1826
## 2: 1 9 1201 0 1368 B6 JFK 1826
## 3: 1 11 1200 0 1383 B6 JFK 1826
## 4: 1 10 1199 0 1366 B6 JFK 1826
## 5: 1 10 1199 0 1366 B6 JFK 1826
## 6: 36 9 719 0 779 B6 JFK 199
## year.y type engines seats engine temp humid wind_dir
## 1: 2004 Fixed wing multi engine 2 other Turbo-fan 57.92 69.52 S
## 2: 2004 Fixed wing multi engine 2 other Turbo-fan 62.96 70.07 ENE
## 3: 2011 Fixed wing multi engine 2 other Turbo-fan 51.98 36.31 W
## 4: 2006 Fixed wing multi engine 2 other Turbo-fan 60.08 74.56 S
## 5: 2004 Fixed wing multi engine 2 other Turbo-fan 53.06 48.51 SW
## 6: 2011 Fixed wing multi engine 2 other Turbo-fan 77.00 48.19 NW
## wind_speed precip pressure visib tzone w_day week_num V2
## 1: 6.90468 0 1021.0 10 America/Denver Mon 20 ABQ
## 2: 3.45234 0 1024.8 10 America/Denver Sat 38 ABQ
## 3: 18.41248 0 1016.7 10 America/Denver Sun 44 ABQ
## 4: 11.50780 0 1009.3 10 America/Denver Sat 41 ABQ
## 5: 20.71404 0 1014.5 10 America/Denver Sat 42 ABQ
## 6: 13.80936 0 1015.5 10 America/New_York Thu 35 ACK
## manu_model
## 1: other
## 2: other
## 3: other
## 4: other
## 5: other
## 6: other
There is a need to balance the dataset because the amount of delays is ~20%. Imbalanced data can cause the model to be very good at learning the class that is the majority, but it will not learn to predict the minority class at the same level. There are mainly two ways to handle an imbalanced dataset. 1. Random oversampling- duplicates examples from the minority class in the training dataset and can result in overfitting for some models. 2. Random undersampling- delete examples from the majority class and can result in losing information invaluable to a model. We are using ovun.sample function from the ROSE package, which creates possibly balanced samples by a combination of random over-sampling minority examples and under-sampling majority examples.
set.seed(20)
get_balanced_df <- function(flights_full_arranged) {
balanced.data <-
ovun.sample(
dep_delay ~ .,
flights_full_arranged,
method = "both",
p = 0.5,
seed = 1996
)$data
return(balanced.data)
}
#balanced_data
balanced_data <- get_balanced_df(flights_full_arranged)
table(balanced_data$dep_delay)
##
## 0 1
## 107373 108299
And this is after balancing the data:
Next, split the balanced data to train and test (80/20)
train_test_split <- function(balanced_data) {
sample = sample.split(balanced_data$dep_delay, SplitRatio = .8)
train = subset(balanced_data, sample == TRUE)
test = subset(balanced_data, sample == FALSE)
train_test <- list(train, test)
return(train_test)
}
train <- train_test_split(balanced_data)[[1]]
test <- train_test_split(balanced_data)[[2]]
dim(train)
## [1] 172537 25
dim(test)
## [1] 43135 25
sensitivity check - creation of balanced data and splitting into train-test data sets:
#threshold = 15
balanced_data_15 <- get_balanced_df(flights_full_arranged_15) #balance the data
train_15 <- train_test_split(balanced_data_15)[[1]] #divide to train and test
test_15 <- train_test_split(balanced_data_15)[[2]]
#threshold = 25
balanced_data_25 <- get_balanced_df(flights_full_arranged_25) #balance the data
train_25 <- train_test_split(balanced_data_25)[[1]] #divide to train and test
test_25 <- train_test_split(balanced_data_25)[[2]]
Before running prediction model, we would like to choose best parameters for optimizing the model. Thus, we used the machine learning ‘mlr’ R package that supplies tools to train several different models and to perform tuning for hyperparameters. Using this tools we can view each hyperparameter impact on the performance of the model.In rpart decision tree library, we control the parameters using the rpart.control() function.
train_df<-as.data.frame(train)
test_df<-as.data.frame(test)
model.mlr <- makeClassifTask(
data=train_df,
target="dep_delay"
)
# Search Parameter for Max Depth
param_grid <- makeParamSet(
makeDiscreteParam("maxdepth", values=seq(20,30, by=3)),
makeDiscreteParam("cp", values=seq(0.0001, 0.0006, by=0.0002)),
makeDiscreteParam("minsplit", values=seq(15,25, by=3))
)
# Define Grid
control_grid = makeTuneControlGrid()
# Define Cross Validation
resample = makeResampleDesc("CV", iters = 3L)
# Define Measure
measure = acc
#set.seed(123)
dt_tuneparam <- tuneParams(learner='classif.rpart',
task=model.mlr,
resampling = resample,
measures = measure,
par.set=param_grid,
control=control_grid,
show.info = TRUE)
dt_tuneparam
## Tune result:
## Op. pars: maxdepth=29; cp=1e-04; minsplit=18
## acc.test.mean=0.7090479
result_hyperparam <- generateHyperParsEffectData(dt_tuneparam, partial.dep = TRUE)
Plot of mean accuracy test as a function of cp parameter during tuning parameters
ggplot(
data = result_hyperparam$data,
aes(x = cp, y=acc.test.mean)
) + geom_line(color = 'darkblue') +labs(title="Test accuracy mean per complexity parameter")+
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
best_parameters_multi = setHyperPars(
makeLearner("classif.rpart", predict.type = "prob"),
par.vals = dt_tuneparam$x
)
result_hyperparam.multi <- generateHyperParsEffectData(dt_tuneparam, partial.dep = TRUE)
Plot of accuracy per parameters’ values according to tuning parameters process
full_sample_param<-result_hyperparam.multi$data
result_sample <- result_hyperparam.multi$data %>%
sample_n(nrow(full_sample_param))
hyperparam.plot <- plot_ly(result_sample,
x = ~cp,
y = ~maxdepth,
z = ~minsplit,
marker = list(color = ~acc.test.mean, colorscale = list(c(0, 1), c("darkred", "darkgreen")), showscale = TRUE))
hyperparam.plot <- hyperparam.plot %>% add_markers() %>% layout(title="Tuning hyperparameters results")
hyperparam.plot
Run classification decision tree on train set with best ‘max depth’ parameter value and ‘min split’ parameter value. We chose cp=0.0004 instead of cp=0.0001 despite it outputs worse accuracy test, because cp=0.0004 enables visualization and better interpretation of important predicting variables. Ideally we would like to choose cp=0.0001 which gives higher accuracy on validation data set as observed during tuning parameters. overfitting????
# build model according to best parameters
class_best_param = rpart(dep_delay~ ., data=train, method = 'class', control = c(cp=0.0004, maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit))
Visualization of the tree:
split.fun <- function(x, labs, digits, varlen, faclen)
{
# replace commas with spaces (needed for strwrap)
labs <- gsub(",", " ", labs)
for(i in 1:length(labs)) {
# split labs[i] into multiple lines
labs[i] <- paste(strwrap(labs[i], width = 15), collapse = "\n")
}
labs
}
rpart.plot(class_best_param, split.fun = split.fun)
Prediction on test data set:
predicted_class<- predict(class_best_param, test, type = 'class')
Calculate a confusion matrix which shows the number of correct negative predictions, the number of incorrect positive predictions, the number of incorrect negative prediction, and the number of correct positive predictions (Visa, Ramsay, Ralescu, & Van Der Knaap, 2011).
confusion_class <- confusionMatrix(predicted_class, test$dep_delay)
confusion_class
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 15330 7220
## 1 6145 14440
##
## Accuracy : 0.6902
## 95% CI : (0.6858, 0.6945)
## No Information Rate : 0.5021
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3804
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7139
## Specificity : 0.6667
## Pos Pred Value : 0.6798
## Neg Pred Value : 0.7015
## Prevalence : 0.4979
## Detection Rate : 0.3554
## Detection Prevalence : 0.5228
## Balanced Accuracy : 0.6903
##
## 'Positive' Class : 0
##
We would like to estimate the accuracy of test data set - a metric used in classification problems to tell the percentage of accurate predictions. Accuracy is calculated by dividing the number of correct predictions (true positive and true negative) by the total number of samples obtained from confusion matrix.
accuracy_num_class<-accuracy(test$dep_delay, predicted_class)
accuracy_num_class
## [1] 0.6901588
We can see from the low test accuracy score shown above that our model failed to predict successfully delayed flights.
As mentioned earlier, we chose 20 minutes as a threshold to determine if a flight is considered delayed. We generated two additional datasets: one with a higher threshold (25 minutes) and one with a lower threshold (15 minutes) in order to test the sensitivity of our predictions. Performing our model on these datasets:
##sensitivity check
#threshold = 15
class_15 <- rpart(dep_delay~ ., data=train_15, method = 'class', control = c(cp=0.0004, maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit)) #run classification decision tree
pred_class_15 <- predict(class_15, test_15, type = 'class') #prediction
confusion_class_15 <- confusionMatrix(pred_class_15, test_15$dep_delay)
confusion_class_15
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 15330 7220
## 1 6145 14440
##
## Accuracy : 0.6902
## 95% CI : (0.6858, 0.6945)
## No Information Rate : 0.5021
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3804
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7139
## Specificity : 0.6667
## Pos Pred Value : 0.6798
## Neg Pred Value : 0.7015
## Prevalence : 0.4979
## Detection Rate : 0.3554
## Detection Prevalence : 0.5228
## Balanced Accuracy : 0.6903
##
## 'Positive' Class : 0
##
accuracy_num_class_15<-accuracy(test$dep_delay, pred_class_15)
#threshold = 25
class_25 <- rpart(dep_delay~ ., data=train_25, method = 'class', control = c(cp=0.0004, maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit)) #run classification decision tree
pred_class_25 <- predict(class_25, test_25, type = 'class') #prediction
confusion_class_25 <- confusionMatrix(pred_class_25, test_25$dep_delay)
confusion_class_25
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 15330 7220
## 1 6145 14440
##
## Accuracy : 0.6902
## 95% CI : (0.6858, 0.6945)
## No Information Rate : 0.5021
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3804
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7139
## Specificity : 0.6667
## Pos Pred Value : 0.6798
## Neg Pred Value : 0.7015
## Prevalence : 0.4979
## Detection Rate : 0.3554
## Detection Prevalence : 0.5228
## Balanced Accuracy : 0.6903
##
## 'Positive' Class : 0
##
accuracy_num_class_25<-accuracy(test$dep_delay, pred_class_25)
A visualization of all confusion matrices (of datasets based on threshold=20, threshold=15 and threshold=25 defining delayed flights):
Plot of each predictor’s importance in simple classification tree model when the threshold=20:
class_best_param$variable.importance %>%
data.frame() %>%
rownames_to_column(var = "Feature") %>%
rename(Overall = '.') %>%
ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
theme_minimal() +
coord_flip() +
labs(x = "", y = "", title = "Variable Importance with Simple Classication (20)")+
theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))
Plot of each predictor’s importance in simple classification tree model in case of threshold=15:
class_15$variable.importance %>%
data.frame() %>%
rownames_to_column(var = "Feature") %>%
rename(Overall = '.') %>%
ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
theme_minimal() +
coord_flip() +
labs(x = "", y = "", title = "Variable Importance with Simple Classication (15)")+
theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))
Plot of each predictor’s importance in simple classification tree model in case of threshold=25:
class_25$variable.importance %>%
data.frame() %>%
rownames_to_column(var = "Feature") %>%
rename(Overall = '.') %>%
ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
theme_minimal() +
coord_flip() +
labs(x = "", y = "", title = "Variable Importance with Simple Classication (25)")+
theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))
The decision tree algorithm is relatively easy to understand and interpret. However, a single tree seems insufficient for producing effective results. We chose to run a Random Forest algorithm as it leverages the power of multiple decision trees. It does not rely on the feature importance given by a single decision tree but chooses features randomly during the training process. Therefore, the random forest can generalize the data in a better way. This randomized feature selection makes Random Forest much more accurate than a decision tree.
Run Random Forest on the train set
run_random <- function(train, ntrees=1000, depth = NULL) {
fit <- ranger(
dep_delay ~ .,
data = train,
num.trees = ntrees,
importance = 'impurity',
max.depth = depth,
verbose = TRUE,
)
return(fit)
}
rf <- run_random(train)
## Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 43 seconds.
## Growing trees.. Progress: 47%. Estimated remaining time: 1 minute, 9 seconds.
## Growing trees.. Progress: 71%. Estimated remaining time: 37 seconds.
## Growing trees.. Progress: 93%. Estimated remaining time: 9 seconds.
rf
## Ranger result
##
## Call:
## ranger(dep_delay ~ ., data = train, num.trees = ntrees, importance = "impurity", max.depth = depth, verbose = TRUE, )
##
## Type: Classification
## Number of trees: 1000
## Sample size: 172537
## Number of independent variables: 24
## Mtry: 4
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 7.78 %
We can see that the OOB prediction error is very low, which means the estimated accuracy is high!
Prediction of the test data set:
pred <- predict(rf, test)
confusion <- table(test$dep_delay, pred$predictions)
confusionMatrix(confusion)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 21064 411
## 1 254 21406
##
## Accuracy : 0.9846
## 95% CI : (0.9834, 0.9857)
## No Information Rate : 0.5058
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9692
##
## Mcnemar's Test P-Value : 1.454e-09
##
## Sensitivity : 0.9881
## Specificity : 0.9812
## Pos Pred Value : 0.9809
## Neg Pred Value : 0.9883
## Prevalence : 0.4942
## Detection Rate : 0.4883
## Detection Prevalence : 0.4979
## Balanced Accuracy : 0.9846
##
## 'Positive' Class : 0
##
As we can see we have high accuracy for predicting delay!
Running random forest model followed with sensitivity checks. We applied the current random forest model on a data based on 20 minutes threshold as definition for delayed flight and 15 minutes threshold, as we performed in simple classification decision tree. Results of random forest model on these datasets:
##sensitivity check
#threshold = 15
rf_15 <- run_random(train_15) #run random forest
## Growing trees.. Progress: 31%. Estimated remaining time: 1 minute, 7 seconds.
## Growing trees.. Progress: 60%. Estimated remaining time: 41 seconds.
## Growing trees.. Progress: 88%. Estimated remaining time: 13 seconds.
rf_15
## Ranger result
##
## Call:
## ranger(dep_delay ~ ., data = train, num.trees = ntrees, importance = "impurity", max.depth = depth, verbose = TRUE, )
##
## Type: Classification
## Number of trees: 1000
## Sample size: 172537
## Number of independent variables: 24
## Mtry: 4
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 7.75 %
pred_15 <- predict(rf_15, test_15) #prediction
confusion_15 <- table(test_15$dep_delay, pred_15$predictions)
confusion_15
##
## 0 1
## 0 21061 414
## 1 257 21403
confusionMatrix(confusion_15)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 21061 414
## 1 257 21403
##
## Accuracy : 0.9844
## 95% CI : (0.9832, 0.9856)
## No Information Rate : 0.5058
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9689
##
## Mcnemar's Test P-Value : 1.719e-09
##
## Sensitivity : 0.9879
## Specificity : 0.9810
## Pos Pred Value : 0.9807
## Neg Pred Value : 0.9881
## Prevalence : 0.4942
## Detection Rate : 0.4883
## Detection Prevalence : 0.4979
## Balanced Accuracy : 0.9845
##
## 'Positive' Class : 0
##
#threshold = 25
rf_25 <- run_random(train_25) #run random forest
## Growing trees.. Progress: 16%. Estimated remaining time: 2 minutes, 43 seconds.
## Growing trees.. Progress: 32%. Estimated remaining time: 2 minutes, 14 seconds.
## Growing trees.. Progress: 49%. Estimated remaining time: 1 minute, 35 seconds.
## Growing trees.. Progress: 67%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 85%. Estimated remaining time: 26 seconds.
## Growing trees.. Progress: 97%. Estimated remaining time: 5 seconds.
rf_25
## Ranger result
##
## Call:
## ranger(dep_delay ~ ., data = train, num.trees = ntrees, importance = "impurity", max.depth = depth, verbose = TRUE, )
##
## Type: Classification
## Number of trees: 1000
## Sample size: 172537
## Number of independent variables: 24
## Mtry: 4
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 7.78 %
pred_25 <- predict(rf_25, test_25) #prediction
confusion_25 <- table(test_25$dep_delay, pred_25$predictions)
confusion_25
##
## 0 1
## 0 21061 414
## 1 251 21409
confusionMatrix(confusion_25)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 21061 414
## 1 251 21409
##
## Accuracy : 0.9846
## 95% CI : (0.9834, 0.9857)
## No Information Rate : 0.5059
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9692
##
## Mcnemar's Test P-Value : 3.341e-10
##
## Sensitivity : 0.9882
## Specificity : 0.9810
## Pos Pred Value : 0.9807
## Neg Pred Value : 0.9884
## Prevalence : 0.4941
## Detection Rate : 0.4883
## Detection Prevalence : 0.4979
## Balanced Accuracy : 0.9846
##
## 'Positive' Class : 0
##
A visualization of the confusion matrices:
Lastly, we produced the importance of each predictor:
get_importance <- function(rf) {
imps <- data.frame(
var = names(train)[-5],
imps = rf$variable.importance / max(rf$variable.importance)
)
return(imps)
}
imps <- get_importance(rf)
imps_15 <- get_importance(rf_15)
imps_25 <- get_importance(rf_25)
Plot of predictors’ importance:
Check for overfit with twice the number of trees and a limited depth for the trees:
rf_check <- run_random(train, ntrees = 2000, depth = 20)
## Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
## Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 40 seconds.
## Growing trees.. Progress: 57%. Estimated remaining time: 1 minute, 9 seconds.
## Growing trees.. Progress: 76%. Estimated remaining time: 38 seconds.
## Growing trees.. Progress: 93%. Estimated remaining time: 12 seconds.
pred_check <- predict(rf_check, test)
confusion_check <- table(test$dep_delay, pred_check$predictions)
confusionMatrix(confusion_check)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 20361 1114
## 1 980 20680
##
## Accuracy : 0.9515
## 95% CI : (0.9494, 0.9535)
## No Information Rate : 0.5053
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9029
##
## Mcnemar's Test P-Value : 0.003656
##
## Sensitivity : 0.9541
## Specificity : 0.9489
## Pos Pred Value : 0.9481
## Neg Pred Value : 0.9548
## Prevalence : 0.4947
## Detection Rate : 0.4720
## Detection Prevalence : 0.4979
## Balanced Accuracy : 0.9515
##
## 'Positive' Class : 0
##
# plot confusion matrix
plot(
confusion_check,
xlab = 'Predicted',
ylab = 'True',
main = sprintf(
'Predicted vs Obsvered
OOB prediction error: %s',
round(rf_check$prediction.error, 3)
)
)
imps_check <- get_importance(rf_check)
imps_check %>%
ggplot(aes(imps, x = reorder(var, imps))) +
geom_point(size = 3, colour = "#ff6767") +
coord_flip() +
labs(x = "Predictors", y = "Importance scores") +
ggtitle('Importance') +
theme_bw(18)
We pre-processed the data to optimize each parameter’s potential to be helpful in prediction. We used permutation tests to determine which of the levels in each categorical variable with many levels are significant in predicting delays.
We converted the variable dep_delay from a delay in minutes to a binary variable. 0- no delay, 1- there is a delay. The division was made based on a threshold of late minutes. In order to test the sensitivity to this threshold, we created two more datasets, one with a higher threshold and one with a lower threshold, and ran the prediction models on these two datasets. In both models, decision tree and random forest, there were no differences between the primary dataset and the datasets created for sensitivity testing.
In the first step, we ran a decision tree model. This model reached an accuracy level of about 70% in delay prediction. In the second step, we wanted to check whether a Random Forest model could learn the data better and give a more accurate prediction. We thought this model would be better because random forests are much more robust than a single decision tree. They aggregate many decision trees to limit overfitting and error due to bias and yield valuable results.
The Random Forest model reached a much higher level of accuracy than the decision tree, 98%. In principle, the chance of an overfit in Random Forest is low thanks to bagging and random feature selection; they are trained independently on different subsets of the training data. However, the high accuracy score raises our concern that the model may overfit the data. In order to examine the matter, we performed another run of the Random Forest model where we limited the maximum tree depth and increased the number of trees (from 1000 to 2000). The result of this run showed 94% accuracy, which is still high but is unlikely to be overfitted to the large number of trees and the limitation of the trees’ depth. Further, the model is highly accurate for the test as well, so it may just be excellent at predicting. Another possible explanation is that there is indeed an overfit resulting from the fact that the training and the test are similar since they were taken from the same data set.
In conclusion, we have shown that it is possible to predict if there will be a delay in takeoff with a high probability using a Random Forest model that considers various parameters related to weather, aircraft characteristics, airport, destination, airlines, etc. For further research, it is worthwhile to implement the model on current databases for two reasons; The first is to verify the model and see that it is not adjusted only for 2013. Second, since the Corona epidemic, the world of flights has changed, for example the airlines have changed partly due to bankruptcies of low-cost companies, and the loads at the airport have changed as well.
On the relevance of data science for flight delay research: a systematic review Carvalho L, Sternberg A, Maia Gonçalves L, Beatriz Cruz A, Soares JA et al. Transp. Rev., 2021
A Review on Flight Delay Prediction Sternberg A, Soares J, Carvalho D, Ogasawara E arXiv [cs.CY], 2017
Characterization and prediction of air traffic delays Rebollo JJ, Balakrishnan H Transp. Res. Part C: Emerg. Technol., 2014
Visa, S., Ramsay, B., Ralescu, A. L., & Van Der Knaap, E. (2011). Confusion matrix-based feature selection. MAICS, 710(1), 120-127.